Question

2

Downloading contents of the web page

rated 0 times [ 2] [ 0] / answers: 1 / hits: 10989 / 2 Years ago, sun, march 20, 2022, 7:04:18

I want to write a python program to download the contents of a web page, and then download the contents of the web pages that the first page links to.

For example, this is main web page http://www.adobe.com/support/security/, and the pages I want to download: http://www.adobe.com/support/security/bulletins/apsb13-23.html and http://www.adobe.com/support/security/bulletins/apsb13-22.html

There is a certain condition I want to meet: it should download only web pages under bulletins not under advisories(http://www.adobe.com/support/security/advisories/apsa13-02.html)

 #!/usr/bin/env python

 import urllib

 import re

 import sys

 page = urllib.urlopen("http://www.adobe.com/support/security/")

 page = page.read()

 fileHandle = open('content', 'w')

 links = re.findall(r"<a.*?s*href="(.*?)".*?>(.*?)</a>", page)

 for link in links:

 sys.stdout = fileHandle

 print ('%s' % (link[0]))

 sys.stdout = sys.__stdout__

 fileHandle.close() 

 os.system("grep -i '/support/security/bulletins/' content >> content1")

I've already extracted the link of bulletins into a content1, but don't know how to download the content of those web pages, by providing content1 as input.

The content1 file is as shown below:- /support/security/bulletins/apsb13-23.html
/support/security/bulletins/apsb13-23.html
/support/security/bulletins/apsb13-22.html
/support/security/bulletins/apsb13-22.html
/support/security/bulletins/apsb13-21.html
/support/security/bulletins/apsb13-21.html
/support/security/bulletins/apsb13-22.html
/support/security/bulletins/apsb13-22.html
/support/security/bulletins/apsb13-15.html
/support/security/bulletins/apsb13-15.html
/support/security/bulletins/apsb13-07.html

Answers

Only authorized users can answer the question. Please sign in first, or register a free account.

sweetrifiabl

Add To Favorites

Follow

Total Points: 422

Total Questions: 94

Total Answers: 120

Location: Bonaire

Member since Sat, Sep 24, 2022

2 Years ago

sweetrifiabl questions

1 Inconsistent dependency on systemd package versions

Mon, Apr 4, 22, 16:02, 2 Years ago

1 Ubuntu 21.10 static ip network configuration

Sat, Jul 17, 21, 01:40, 3 Years ago

1 Internal Monitor Not Detected Ubuntu 21.10 Acer Nitro 5

Thu, Nov 18, 21, 17:36, 2 Years ago

1 How to update Thunderbird 78 to 91 on Ubuntu 20.04?

Mon, Jan 17, 22, 19:02, 2 Years ago

1 How can I set the destination to "anywhere" in the iptables?

Fri, Dec 9, 22, 20:16, 1 Year ago

View All

answered 2 Years ago algicuade · Accepted Answer

If I understood your question, the following script should be what you want:

#!/usr/bin/env python



import urllib

import re

import sys

import os

page = urllib.urlopen("http://www.adobe.com/support/security/")

page = page.read()

fileHandle = open('content', 'w')

links = re.findall(r"<a.*?s*href="(.*?)".*?>(.*?)</a>", page)

for link in links:

    sys.stdout = fileHandle

    print ('%s' % (link[0]))

sys.stdout = sys.__stdout__

fileHandle.close() 

os.system("grep -i '/support/security/bulletins/' content 2>/dev/null | head -n 3 | uniq | sed -e 's/^/http://www.adobe.com/g' > content1")

os.system("wget -i content1")