Saturday, April 20, 2024
 Popular · Latest · Hot · Upcoming
2
rated 0 times [  2] [ 0]  / answers: 1 / hits: 10989  / 2 Years ago, sun, march 20, 2022, 7:04:18

I want to write a python program to download the contents of a web page, and then download the contents of the web pages that the first page links to.



For example, this is main web page http://www.adobe.com/support/security/, and the pages I want to download: http://www.adobe.com/support/security/bulletins/apsb13-23.html and http://www.adobe.com/support/security/bulletins/apsb13-22.html



There is a certain condition I want to meet: it should download only web pages under bulletins not under advisories(http://www.adobe.com/support/security/advisories/apsa13-02.html)



 #!/usr/bin/env python
import urllib
import re
import sys
page = urllib.urlopen("http://www.adobe.com/support/security/")
page = page.read()
fileHandle = open('content', 'w')
links = re.findall(r"<a.*?s*href="(.*?)".*?>(.*?)</a>", page)
for link in links:
sys.stdout = fileHandle
print ('%s' % (link[0]))
sys.stdout = sys.__stdout__
fileHandle.close()
os.system("grep -i '/support/security/bulletins/' content >> content1")


I've already extracted the link of bulletins into a content1, but don't know how to download the content of those web pages, by providing content1 as input.



The content1 file is as shown below:- /support/security/bulletins/apsb13-23.html
/support/security/bulletins/apsb13-23.html
/support/security/bulletins/apsb13-22.html
/support/security/bulletins/apsb13-22.html
/support/security/bulletins/apsb13-21.html
/support/security/bulletins/apsb13-21.html
/support/security/bulletins/apsb13-22.html
/support/security/bulletins/apsb13-22.html
/support/security/bulletins/apsb13-15.html
/support/security/bulletins/apsb13-15.html
/support/security/bulletins/apsb13-07.html


More From » python

 Answers
6

If I understood your question, the following script should be what you want:



#!/usr/bin/env python

import urllib
import re
import sys
import os
page = urllib.urlopen("http://www.adobe.com/support/security/")
page = page.read()
fileHandle = open('content', 'w')
links = re.findall(r"<a.*?s*href="(.*?)".*?>(.*?)</a>", page)
for link in links:
sys.stdout = fileHandle
print ('%s' % (link[0]))
sys.stdout = sys.__stdout__
fileHandle.close()
os.system("grep -i '/support/security/bulletins/' content 2>/dev/null | head -n 3 | uniq | sed -e 's/^/http://www.adobe.com/g' > content1")
os.system("wget -i content1")

[#29269] Monday, March 21, 2022, 2 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
sweetrifiabl

Total Points: 422
Total Questions: 94
Total Answers: 120

Location: Bonaire
Member since Sat, Sep 24, 2022
2 Years ago
sweetrifiabl questions
Mon, Apr 4, 22, 16:02, 2 Years ago
Sat, Jul 17, 21, 01:40, 3 Years ago
Thu, Nov 18, 21, 17:36, 2 Years ago
Mon, Jan 17, 22, 19:02, 2 Years ago
;