Anyone who has done development on *nix-like system knows one of the most important tools is a powerful shell. Windows had to suffer from weak command line tool until PowerShell was introduced few years back.

I like to do as little alt-tabbing as possible, so I like to have my tools available in Visual Studio. PowerConsole is a great add-in that embeds PowerShell into VS 2010.

Installation is just couple of clicks, after that you can find it in View > Other Windows > Power Console

The variable $PowerConsoleUserProfile will tell you the location of profile file. Profile-file is just a PowerShell script that executes on start up. It is good for setting aliases and loading often used libraries etc.


Intro

I needed to do some web crawling for a personal project. Instead of using any of the crawlers that are available I decided to write my own.

Python is a language I love to dabble on so I decided to write my crawler with that. Python also has awesome Beautiful Soup –library for parsing html. I wrote it using IronPython, but also ran it on standard Python 2.7 on Arch Linux. As a developer who mostly works on .NET-platform, I was glad to see that IronPython tools for Visual Studio worked great and that IronPython worked great with third party library (Beautiful Soup).

The crawler

I implemented the crawler as command line utility. I used optparse to parse the arguments.

#arguments:
# site: url of the page where the crawling should start
# file: name of the file where results are saved
# threshold: maximum number of pages that are crawled
# domain: Base url of the domain that is to be crawled
def main():
    opt=optparse.OptionParser()
    opt.add_option('-s', '--site')
    opt.add_option('-f', '--file')
    opt.add_option('-t', '--threshold')
    opt.add_option('-d', '--domain')
    (options, args)=opt.parse_args()
    c=Crawlpy(options.file, options.site)
    c.threshold=options.threshold
    c.domain=options.domain
    c.run()

In the main function we just parse the options, create instance of the Crawlpy-class and start crawling.

In addition of to command line arguments, we initiate property called visited_urls in the constructor of Crawlpy-class. visited_urls is a array of dictionaries containing page title and url information.

class Crawlpy:
    def __init__(self, filename, site):
        self.visited_urls=[]
        self.filename=filename
        self.site=site
        self.threshold=0
        self.domain=''

The run()-method is displayed below:

def run(self):
        base_address=self.site
        link_list=self.crawl(base_address, self.domain)

        for l in link_list:
            new_page = True
            for visited_url in self.visited_urls:
                if l['url'] == visited_url['url']:
                    new_page = False
            if new_page:
                print 'new page: ',l['url']
                self.visited_urls.append(l)
                link_list.extend(self.crawl(l['url'],self.domain))

            if int(self.threshold) > 0 and \
                    int(len(self.visited_urls)) > int(self.threshold):
                print 'threshold: ',self.threshold
                break

        self._Crawlpy__write_to_csv(self.filename)

The crawl()-method returns the initial dictionary array. Looping through that array, we check url is a new one, meaning it is not contained in the visited_urls-array. New page is added to the visited_urls and crawled for it’s content. When visited_urls reaches desired threshold we break out of the loop and write it all to a csv-file.

crawl() is the most important method here. It takes url and domain as parameters. url is url that we want to parse and domain is used to check the url points to the right domain.

 def crawl(self, url, domain):
        #return empty array if opening url fails
        try:
            response = urllib2.urlopen(url)
        except urllib2.HTTPError:
            return []

        #add '/' to the end of url
        if url[len(url)-1] != '/':
            url = url[:url.rindex('/')+1]

        #get all anchor-tags from the page
        html = response.read()
        soup = BeautifulSoup(html)
        links = soup.findAll('a')
        link_list = []
        for link in links:
            #skip links with rel='nofollow'
            if link.get('rel')=='nofollow':
                continue

            href = link.get('href','empty')
            #remove everything after '#'
            if href.find('#')!=-1:
                href=href[:href.index('#')]
            if href.find('?')!=-1:
                href=href[:href.index('?')]

            if not href.startswith('http') and not href.startswith('www'):
                if href.startswith('/'):
                    href=url+href[1:]
                elif href.startswith('..'):
                    href=url[0:url[0:-1].rindex('/')]+href[2:]
                elif href.startswith('.'):
                    href=url+href

            if href.startswith(domain):
                link_dictionary={'title':link.text,
                                'url':href }
                link_list.append(link_dictionary)
        return link_list

The flow of the method is pretty simple:

  • Try to open the url, if it fails return empty array
  • Find all <a>-tags on the page
  • Skip links with ‘nofollow’
  • Don’t want to get stuck in an abyss of some forum, so strip querystrings from url
  • Screw the anchor links (#)
  • If the link is relative, try to piece together something that makes sense

__write_to_csv() is just a simple file writing method, that takes filename as a parameter:

def __write_to_csv(self, filename):
        file=open(filename, 'w')
        file.write('Title,URL\n')
        for url in self.visited_urls:
            file.write('{},{}\n'.format(url['title'],url['url']))
        file.close

Final thoughts

I left the crawler in a really imperfect state. It did what I needed it to do, but there are several flaws still. For example, this crawler has no respect for robots.txt and that should be considered before setting it loose on someones site.

The whole shebang can be found on BitBucket.


This is second part of a series. Check out the first part if you haven’t already. The code can be found at BitBucket repo.

Last time we looked at data retrieval using Active Record –pattern. In this part we’ll do data persistence and do little bit of refactoring.

Saving and deleting

If we have made changes to a record, or created a new one, calling Save()-method persists changes to the data store.

   1: public void Save()

   2: {

   3:     using (var ctx = new DataClassesDataContext())

   4:     {

   5:         var task=ctx.TaskTables.FirstOrDefault(x=>x.id==this.id);

   6:

   7:         //create new record if id is not found

   8:         if (task == null)

   9:         {

  10:             task = new TaskTable();

  11:             ctx.TaskTables.InsertOnSubmit(task);

  12:         }

  13:

  14:         task.title = this.title;

  15:         task.content = this.content;

  16:         task.completed = this.completed;

  17:         ctx.SubmitChanges();

  18:         this.id = task.id;

  19:     }

  20: }

 

bt_code_init(’4ef5e2c5-1499-43cd-9a57-3e1a142d6edc’);

Unlike data retrieval methods, Save() doesn’t need to be declared static, because it is only called on an instance of an object. It handles both update and creation. Line 6 uses lambda expression to find existing record in the data store. If you’re not familiar with lambda expression, you can find more about it here, for example.

Delete() is very straightforward.

   1: public void Delete()

   2: {

   3:     using (var ctx = new DataClassesDataContext())

   4:     {

   5:         ctx.TaskTables.DeleteOnSubmit(

   6:                ctx.TaskTables.SingleOrDefault(x => x.id == this.id));

   7:         ctx.SubmitChanges();

   8:

   9:     }

  10: }

 

You should note that this only removes the row from the database, the active record -object still exists.

Refactoring

As we saw in the last part, our two data retrieval methods share quite a bit of code. The difference was actually just one line. This is not very good. It means that if we have to make change to the way we retrieve data, we would have to remember make same changes in two different places. If we want to add new query capabilities like FetchByTitle, we would have to duplicate code for that too. In order to fixing that duplication, we refactor the class and isolate common functionality.

Isolation of common functionality is achieved with QueryTasks()-method displayed below.

   1: private static List<Task> QueryTasks(Expression<Func<Task,bool>> predicate)

   2: {

   3:     var taskList = new List<Task>();

   4:     using (var ctx = new DataClassesDataContext())

   5:     {

   6:         var query = from task in ctx.TaskTables

   7:                     select new Task

   8:                     {

   9:                         id = task.id,

  10:                         title = task.title,

  11:                         content = task.content,

  12:                         completed=task.completed

  13:                     };

  14:         query.Where(predicate);

  15:         taskList = query.ToList();

  16:     }

  17:     return taskList;

  18: }

It is declared private, since it is only going to be used internally by the class. The method looks quite familiar, but this method takes a lambda expression as parameter and applies it as where-clause of the query (line 14). Our data retrieval methods can now be reduced to this:

   1: public static List<Task> FetchAll()

   2: {

   3:     return QueryTasks(x => true);

   4: }

   5:

   6: public static Task FetchByID(int id)

   7: {

   8:     return QueryTasks(x => x.id == id).FirstOrDefault();

   9: }

Conclusion

At this point we have basic implementation for data access based on active record pattern. Active record can be a nice and simple way to access data without adding too many abstraction layers in an application. Active record introduces tight coupling between data model and object model. Sometimes that is not what you want and you may need to look into alternative data access patterns such as Data Mapper or Repository.


Introduction

I decided to try and write a series about different data access methods. Main motivation being to learn them better. The first one in the series is Active Record. I am mostly a .NET-guy these days and I really want to get to know linq better, so I will be using Linq to SQL and C#.

Martin Fowler describes Active Record -pattern in his book Patterns of Enterprise Application Architecture:

An object that wraps a row in a database table or view, encapsulates the database access, and adds domain logic to the data

This means that for every database table or view a class will be created. Columns of the database table will be the properties of the corresponding class. CRUD-functionality is provided by methods of the class. If there is additional business logic involved in data persisting it will be implemented internally in the active record class.

With Active Record your tables are essentially your domain objects, so it is a good data access pattern if your database has a simple enough structure. If your database structure is going to be complicated or doesn’t map well to object model you might want to consider some other strategy.

Design

We will be designing a simple todo-list application. Using SQL Server database we create a database table called “Tasks”.

image

Picture above displays structure of the Tasks-table. The structure is very simple:

  • id – Incremental id for task
  • title – Name of the task
  • content – Additional notes for the task
  • completed – Boolean value indicating whether the task is completed or not

According to active record -pattern the the database columns will become properties of the corresponding class.

image

The class also needs methods for performing data access:

  • FetchAll() – Return collection of Task-objects
  • FetchById() – Returns a single Task-object
  • Save() – Persists object to data store
  • Delete() – Deletes the record from data store

Implementation

First step in implementation is setting up SQL Server Database and creating the Tasks-table. After that create a new C#-project in Visual Studio and from Add – New Item –dialog choose “Linq to SQL classes”. After that you can just drag and drop database tables on the design surface. For more detailed description check ScottGu’s blog

image

I’ve changed the default name of the Linq to SQL -class from Task to TaskTable, because Task will be the name of the Active Record -class.

Data Retrieval

Like I said earlier, each column in the table becomes a property of the class.

   1: public class Task

   2: {

   3:     public int id { get; set; }

   4:     public string title { get; set; }

   5:     public string content { get; set; }

   6:     public bool completed { get; set; }

   7: }

Then we can start adding the functionality we specified. Let’s start with data retrieval. FetchAll() returns all rows in the datastore.

   1: public static List<Task> FetchAll()

   2: {

   3:     List<Task> lst = new List<Task>();

   4:     using (var ctx = new DataClassesDataContext())

   5:     {

   6:         var query = from task in ctx.TaskTables

   7:                     select new Task

   8:                     {

   9:                         id = task.id,

  10:                         title = task.title,

  11:                         content = task.content,

  12:                         completed = task.completed

  13:                     };

  14:         lst = query.ToList();

  15:     }

  16:     return lst;

  17: }

The method needs to be static, because we must be able to call it before we have created any instances of the Task-class. Query is a Linq query that simply selects the appropriate columns into Task-objects. The query is executed when we call ToList(). We could use the method like this:

var tasks = Task.FetchAll();

We will also want to be able to get a hold of a single Task-record. That’s where we’ll need FetchByID()-method:

   1: public static Task FetchByID(int id)

   2: {

   3:     Task t = new Task();

   4:     using (var ctx = new DataClassesDataContext())

   5:     {

   6:         var query = from task in ctx.TaskTables

   7:                     where task.id==id

   8:                     select new Task

   9:                     {

  10:                         id = task.id,

  11:                         title = task.title,

  12:                         content = task.content,

  13:                         completed=task.completed

  14:                     };

  15:         t = query.First();

  16:     }

  17:     return t;

  18: }

It’s the same as FetchAll() but with added where-clause to specify a single record we want to fetch. This is severely breaking the DRY-principle, and we’ll probably want to refactor this later.

Conclusion

The codes can be found in Bitbucket repo.

As seen in this post, implementing data retrieval with active record can be quite simple. In the next part we’ll implement about create-, update- and delete -functionality and do some refactoring.




Follow

Get every new post delivered to your Inbox.