Sunday, May 5, 2024
 Popular · Latest · Hot · Upcoming
1
rated 0 times [  1] [ 0]  / answers: 1 / hits: 11373  / 2 Years ago, wed, september 21, 2022, 6:38:47

I'm trying to write a Bash script that will extract informations from a HTML page (using wget).
I know my informations will be between <h*> tags, but is there a nice way to get those ?



To be more precise let's have an example :




< h1>header1< /h1>

< h2>header2< /h2>

< h2>otherHeader2< /h2>

< h1>lastHeader1< /h1>

< h2>lastHeader2< /h2>




I'd like to extract "otherHeader2", a.k.a. the second (but it could be anywhere) header afer the header1.


More From » bash

 Answers
7

this is a simple python script that will parse your html and put all the values into a list and print it out. You can either write the rest of your script in python, or call this script from python, or plug this short code as a snippet in bash. Check out the examples below.



test.html



< h1>header1< /h1>
< h2>header2< /h2>
< h2>otherHeader2< /h2>
< h1>lastHeader1< /h1>
< h2>lastHeader2< /h2>


parse_header.py



#!/usr/bin/env python
import sys, re
print re.findall(r'< hd>(.*)< /hd>', sys.stdin.read())


script can be called from bash



cat test.html | parse_header.py


python code can be put into bash script



cat test.html | python -c "import sys, re;print re.findall(r'< hd>(.*)< /hd>', sys.stdin.read())"


the last option I believe is not very readable in your case. It makes more sense if you have some simple code where it's not worth putting it in it's own script.


[#35124] Thursday, September 22, 2022, 2 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
landarre

Total Points: 254
Total Questions: 96
Total Answers: 109

Location: Burundi
Member since Sun, Apr 16, 2023
1 Year ago
;