Tuesday, May 6, 2014

spyder - Indexer and Scraper Runner

Spider by Josch13 (CC0)
Scraping data is relatively straightforward. You simply dig in to some source, extract what you need and store it. I went through the basics of scraping using Node.js earlier. Even though the task itself is simple once you need to scrape a lot you will end up with plenty of boilerplate and there is plenty of repetition.

In order to deal with this issue in a recent project I ended up writing a little framework that makes it easier to write scrapers. spyder provides structure for both writing and running your scrapers. It relies on two basic concepts: indexing and scraping.

In the first pass it executes an indexer. The purpose of this is to extract links to scrape. In the second pass, scraping, it makes sure each link is scraped and then helps you to deal with the results. You can also control the delay between individual scrapes to avoid hitting the server too hard. I'm not sure if that actually helps measurably but I guess it doesn't hurt.

Even though I have used the tool scraping for web, this doesn't mean you cannot use it beyond it. I can imagine it could be useful for extracting insights out of a filesystem. You could set it up to run against a directory of images and analyze them for instance.

If you are into indexing and scraping, check it out. I'm open for improvement requests.