Please download the source files first, since this post is designed to a specific url only.

Before start install BeautifulSoup library. I have described about installation in one of my previous posts.
Or you can find it in many websites.

First, import necessary libraries.

from BeautifulSoup import BeautifulSoup
import urllib2

Then define a file to write in the content.

f1 = open("Output.txt", "w")

Let’s open a website. This post is designed fnly for the given url.

soup = BeautifulSoup(urllib2.urlopen("Your url"))

You can do this in two steps also, as follows.

webContent = urllib2.urlopen("Your url")
soup = BeautifulSoup(webContent)

Now lets extract the content of the site. For example, lets extract a table and part of it’s content.

table1 = soup.findAll('table')[2] This step will extract the table 2, as the soup.findAll returns a list.

Let’s read the rows. I think you know what is <tr> in html.
rows = table1.findAll('tr')

In order to identify the content let’s extract the title also.

title = ''.join(rows[0].findAll('td')[0].findAll(text=True))+"\n"

Here, make sure to use single quotations in ”.join which will will extract the content of first row and first <td></td> tags.
And now let’s write the title to the file.

f1.write("Table Name: %s" % title)

To extract the content we have to loop through the rows. We have looped only up to 7 rows.Then we read all the columns, which are identified by the <td></td> tags.

for tr in rows[:7]:
cols = tr.findAll('td')

Columns are read from 8 to 12 and load to a variable named tblCont. Through each loop the content is written to the file.
tblCont =""
for td in cols[8:12]:
tblCont = tblCont + "\t"+td.find(text=True)
f1.write("%s" % tblCont+"\n")

Finally, the file objects has to be closed.

f1.close()

Now save the file.

Press F5. (If you are using IDLE)

Parse web pages and write down the content to text files

Leave a Reply

Your email address will not be published. Required fields are marked *