Question

1

Parsing a HTML page with Bash

rated 0 times [ 1] [ 0] / answers: 1 / hits: 11373 / 2 Years ago, wed, september 21, 2022, 6:38:47

I'm trying to write a Bash script that will extract informations from a HTML page (using wget).
I know my informations will be between <h*> tags, but is there a nice way to get those ?

To be more precise let's have an example :

< h1>header1< /h1>

< h2>header2< /h2>

< h2>otherHeader2< /h2>

< h1>lastHeader1< /h1>

< h2>lastHeader2< /h2>

I'd like to extract "otherHeader2", a.k.a. the second (but it could be anywhere) header afer the header1.

Answers

Only authorized users can answer the question. Please sign in first, or register a free account.

landarre

Add To Favorites

Follow

Total Points: 254

Total Questions: 96

Total Answers: 109

Location: Burundi

Member since Sun, Apr 16, 2023

1 Year ago

answered 2 Years ago aveerakfas · Accepted Answer

this is a simple python script that will parse your html and put all the values into a list and print it out. You can either write the rest of your script in python, or call this script from python, or plug this short code as a snippet in bash. Check out the examples below.

test.html

< h1>header1< /h1>

< h2>header2< /h2>

< h2>otherHeader2< /h2>

< h1>lastHeader1< /h1>

< h2>lastHeader2< /h2>

parse_header.py

#!/usr/bin/env python

import sys, re

print re.findall(r'< hd>(.*)< /hd>', sys.stdin.read())

script can be called from bash

cat test.html | parse_header.py

python code can be put into bash script

cat test.html | python -c "import sys, re;print re.findall(r'< hd>(.*)< /hd>', sys.stdin.read())"

the last option I believe is not very readable in your case. It makes more sense if you have some simple code where it's not worth putting it in it's own script.