Concurrent tree crawler

I needed an application to crawl a web site containing archive issues of certain magazines and download selected articles.

To access the site, you had to log in using a web form. The web site was constructed in a hierarchical tree-like way: a magazine page consisted of many links to single issues, each issue consisted of links to single articles. On each level, a single page with links contained only a fraction of all of the links to pages of the lower level (e.g. 50 links). You were able to get more links by clicking the “next” or “previous” links (see the figure below).

Multipage nodes

Each node of the web site tree-like structure can correspond to many linked pages

The crawler to explore such a page had to have some concurrency built-in so e.g. one thread would download an article while other would explore the web site in search of other articles to download. This was of course needed for speedier downloading of the web site contents.

I was not able to find a crawler that could be easily applied to this problem, thus I decided to write one myself. After using it and downloading the data I needed, I decided to make it more general so other people could use it.

The algorithmic heart of the program is a generic concurrent tree-crawling algorithm described here: https://github.com/mkobos/tree_crawler/wiki/algorithm/index.pdf. On this foundation, a web site crawler is built. It can be extended in 3 main ways: 1) to crawl some general tree structure, 2) to crawl a web site with tree-like structure, 3) to crawl a web site with a structure similar to my original magazine archive. It has some nice useful features like a schedule of program activity, logging, dealing with different kinds of errors etc.

  • The source of the library containing the implementation of the program along with some more detailed description is placed here: http://github.com/mkobos/tree_crawler. The python package can be also downloaded and installed from the PyPi repository (e.g. by executing pip install --user concurrent_tree_crawler or easy_install --user concurrent_tree_crawler).
  • A more detailed usage of the library and description of the generic tree crawling library is placed here: http://github.com/mkobos/tree_crawler/wiki.
This entry was posted in Python and tagged . Bookmark the permalink.

One Response to Concurrent tree crawler

  1. Lisabeth says:

    This is exactly something I must find more information about, appreciate the publish.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>