Web crawler in python

(published originally April 17, 2011 at simoraman.wordpress.com)

Intro

I needed to do some web crawling for a personal project. Instead of using any of the crawlers that are available I decided to write my own.

Python is a language I love to dabble on so I decided to write my crawler with that. Python also has awesome Beautiful Soup –library for parsing html. I wrote it using IronPython, but also ran it on standard Python 2.7 on Arch Linux. As a developer who mostly works on .NET-platform, I was glad to see that IronPython tools for Visual Studio worked great and that IronPython worked great with third party library (Beautiful Soup).

The crawler

I implemented the crawler as command line utility. I used optparse to parse the arguments.

#arguments:
# site: url of the page where the crawling should start
# file: name of the file where results are saved
# threshold: maximum number of pages that are crawled
# domain: Base url of the domain that is to be crawled
def main():
    opt = optparse.OptionParser()
    opt.add_option('-s', '--site')
    opt.add_option('-f', '--file')
    opt.add_option('-t', '--threshold')
    opt.add_option('-d', '--domain')
    (options, args) = opt.parse_args()
    c = Crawlpy(options.file, options.site)
    c.threshold = options.threshold
    c.domain = options.domain
    c.run()

In the main function we just parse the options, create instance of the Crawlpy-class and start crawling.

In addition of to command line arguments, we initiate property called visited_urls in the constructor of Crawlpy-class. visited_urls is a array of dictionaries containing page title and url information.

class Crawlpy:
    def __init__(self, filename, site):
        self.visited_urls = []
        self.filename = filename
        self.site = site
        self.threshold = 0
        self.domain = ''

The run()-method is displayed below:

def run(self):
        base_address = self.site
        link_list = self.crawl(base_address, self.domain)

        for l in link_list:
            new_page = True
            for visited_url in self.visited_urls:
                if l['url'] == visited_url['url']:
                    new_page = False
            if new_page:
                print 'new page: ',l['url']
                self.visited_urls.append(l)
                link_list.extend(self.crawl(l['url'],self.domain))

            if int(self.threshold) > 0 and \
                    int(len(self.visited_urls)) > int(self.threshold):
                print 'threshold: ',self.threshold
                break

        self._Crawlpy__write_to_csv(self.filename)

The crawl()-method returns the initial dictionary array. Looping through that array, we check url is a new one, meaning it is not contained in the visited_urls-array. New page is added to the visited_urls and crawled for it’s content. When visited_urls reaches desired threshold we break out of the loop and write it all to a csv-file.

crawl() is the most important method here. It takes url and domain as parameters. url is url that we want to parse and domain is used to check the url points to the right domain.

 def crawl(self, url, domain):
        #return empty array if opening url fails
        try:
            response = urllib2.urlopen(url)
        except urllib2.HTTPError:
            return []

        #add '/' to the end of url
        if url[len(url)-1] != '/':
            url = url[:url.rindex('/')+1]

        #get all anchor-tags from the page
        html = response.read()
        soup = BeautifulSoup(html)
        links = soup.findAll('a')
        link_list = []
        for link in links:
            #skip links with rel = 'nofollow'
            if link.get('rel') == 'nofollow':
                continue

            href = link.get('href','empty')
            #remove everything after '#'
            if href.find('#') != -1:
                href=href[:href.index('#')]
            if href.find('?') != -1:
                href=href[:href.index('?')]

            if not href.startswith('http') and not href.startswith('www'):
                if href.startswith('/'):
                    href = url + href[1:]
                elif href.startswith('..'):
                    href = url[0:url[0:-1].rindex('/')] + href[2:]
                elif href.startswith('.'):
                    href = url + href

            if href.startswith(domain):
                link_dictionary = {'title':link.text,
                                'url':href }
                link_list.append(link_dictionary)
        return link_list

The flow of the method is pretty simple:

Try to open the url, if it fails return empty array
Find all <a>-tags on the page
Skip links with 'nofollow'
Don't want to get stuck in an abyss of some forum, so strip querystrings from url
Screw the anchor links (#)
If the link is relative, try to piece together something that makes sense

__write_to_csv() is just a simple file writing method, that takes filename as a parameter:

def __write_to_csv(self, filename):
        file = open(filename, 'w')
        file.write('Title,URL\n')
        for url in self.visited_urls:
            file.write('{},{}\n'.format(url['title'],url['url']))
        file.close

Final thoughts

I left the crawler in a really imperfect state. It did what I needed it to do, but there are several flaws still. For example, this crawler has no respect for robots.txt and that should be considered before setting it loose on someones site.

The whole shebang can be found on BitBucket.

Simo Råman

Software professional with a passion for quality. Likes TDD and working in agile teams. Has worked with wide-range of technologies, backend and frontend, from C++ to Javascript. Currently very interested in functional programming.

April 17, 2011

Subscribe to this blog

Web crawler in python

(published originally April 17, 2011 at simoraman.wordpress.com)

Intro

The crawler

Final thoughts

Share this article with friends