Intro
I needed to do some web crawling for a personal project. Instead of using any of the crawlers that are available I decided to write my own.
Python is a language I love to dabble on so I decided to write my crawler with that. Python also has awesome Beautiful Soup –library for parsing html. I wrote it using IronPython, but also ran it on standard Python 2.7 on Arch Linux. As a developer who mostly works on .NET-platform, I was glad to see that IronPython tools for Visual Studio worked great and that IronPython worked great with third party library (Beautiful Soup).
The crawler
I implemented the crawler as command line utility. I used optparse to parse the arguments.
#arguments:
# site: url of the page where the crawling should start
# file: name of the file where results are saved
# threshold: maximum number of pages that are crawled
# domain: Base url of the domain that is to be crawled
def main():
opt = optparse.OptionParser()
opt.add_option('-s', '--site')
opt.add_option('-f', '--file')
opt.add_option('-t', '--threshold')
opt.add_option('-d', '--domain')
(options, args) = opt.parse_args()
c = Crawlpy(options.file, options.site)
c.threshold = options.threshold
c.domain = options.domain
c.run()
In the main function we just parse the options, create instance of the Crawlpy-class and start crawling.
In addition of to command line arguments, we initiate property called visited_urls in the constructor of Crawlpy-class. visited_urls is a array of dictionaries containing page title and url information.
class Crawlpy:
def __init__(self, filename, site):
self.visited_urls = []
self.filename = filename
self.site = site
self.threshold = 0
self.domain = ''
The run()-method is displayed below:
def run(self):
base_address = self.site
link_list = self.crawl(base_address, self.domain)
for l in link_list:
new_page = True
for visited_url in self.visited_urls:
if l['url'] == visited_url['url']:
new_page = False
if new_page:
print 'new page: ',l['url']
self.visited_urls.append(l)
link_list.extend(self.crawl(l['url'],self.domain))
if int(self.threshold) > 0 and \
int(len(self.visited_urls)) > int(self.threshold):
print 'threshold: ',self.threshold
break
self._Crawlpy__write_to_csv(self.filename)
The crawl()-method returns the initial dictionary array. Looping through that array, we check url is a new one, meaning it is not contained in the visited_urls-array. New page is added to the visited_urls and crawled for it’s content. When visited_urls reaches desired threshold we break out of the loop and write it all to a csv-file.
crawl() is the most important method here. It takes url and domain as parameters. url is url that we want to parse and domain is used to check the url points to the right domain.
def crawl(self, url, domain):
#return empty array if opening url fails
try:
response = urllib2.urlopen(url)
except urllib2.HTTPError:
return []
#add '/' to the end of url
if url[len(url)-1] != '/':
url = url[:url.rindex('/')+1]
#get all anchor-tags from the page
html = response.read()
soup = BeautifulSoup(html)
links = soup.findAll('a')
link_list = []
for link in links:
#skip links with rel = 'nofollow'
if link.get('rel') == 'nofollow':
continue
href = link.get('href','empty')
#remove everything after '#'
if href.find('#') != -1:
href=href[:href.index('#')]
if href.find('?') != -1:
href=href[:href.index('?')]
if not href.startswith('http') and not href.startswith('www'):
if href.startswith('/'):
href = url + href[1:]
elif href.startswith('..'):
href = url[0:url[0:-1].rindex('/')] + href[2:]
elif href.startswith('.'):
href = url + href
if href.startswith(domain):
link_dictionary = {'title':link.text,
'url':href }
link_list.append(link_dictionary)
return link_list
The flow of the method is pretty simple:
- Try to open the url, if it fails return empty array
- Find all <a>-tags on the page
- Skip links with 'nofollow'
- Don't want to get stuck in an abyss of some forum, so strip querystrings from url
- Screw the anchor links (#)
- If the link is relative, try to piece together something that makes sense
__write_to_csv() is just a simple file writing method, that takes filename as a parameter:
def __write_to_csv(self, filename):
file = open(filename, 'w')
file.write('Title,URL\n')
for url in self.visited_urls:
file.write('{},{}\n'.format(url['title'],url['url']))
file.close
Final thoughts
I left the crawler in a really imperfect state. It did what I needed it to do, but there are several flaws still. For example, this crawler has no respect for robots.txt and that should be considered before setting it loose on someones site.
The whole shebang can be found on BitBucket.
Software professional with a passion for quality. Likes TDD and working in agile teams. Has worked with wide-range of technologies, backend and frontend, from C++ to Javascript. Currently very interested in functional programming.