I needed an application to crawl a web site containing archive issues of certain magazines and download selected articles.
To access the site, you had to log in using a web form. The web site was constructed in a hierarchical tree-like way: a magazine page consisted of many links to single issues, each issue consisted of links to single articles. On each level, a single page with links contained only a fraction of all of the links to pages of the lower level (e.g. 50 links). You were able to get more links by clicking the “next” or “previous” links (see the figure below).
The crawler to explore such a page had to have some concurrency built-in so e.g. one thread would download an article while other would explore the web site in search of other articles to download. This was of course needed for speedier downloading of the web site contents.
I was not able to find a crawler that could be easily applied to this problem, thus I decided to write one myself. After using it and downloading the data I needed, I decided to make it more general so other people could use it.
The algorithmic heart of the program is a generic concurrent tree-crawling algorithm described here: https://github.com/mkobos/tree_crawler/wiki/algorithm/index.pdf. On this foundation, a web site crawler is built. It can be extended in 3 main ways: 1) to crawl some general tree structure, 2) to crawl a web site with tree-like structure, 3) to crawl a web site with a structure similar to my original magazine archive. It has some nice useful features like a schedule of program activity, logging, dealing with different kinds of errors etc.
- The source of the library containing the implementation of the program along with some more detailed description is placed here: http://github.com/mkobos/tree_crawler. The python package can be also downloaded and installed from the PyPi repository (e.g. by executing
pip install --user concurrent_tree_crawleror
easy_install --user concurrent_tree_crawler).
- A more detailed usage of the library and description of the generic tree crawling library is placed here: http://github.com/mkobos/tree_crawler/wiki.