Saturday, March 16, 2013

Scraping the Web Using Node.js

The Dregs by jazzijava (CC BY-NC-ND)
An important part of a data analyst's work is gathering data. Sometimes you might get it in a nice, machine readable format (XML, JSON, CVS, you name it). There are times when you have to work a little to get the data in a decent format.

Node.js + Cheerio + Request - a Great Combo

As it happens Node.js and associated technologies are a great fit for this purpose. You get to use a familiar query syntax. And there is tooling available. A lot of it.

Disfigured by scabeater
(CC BY-NC-ND)

My absolute favorite used to be Zombie.js. Although designed mainly for testing it works often alright for scraping. node.io is another good alternative. In a certain case I had to use a combination of request, htmlparser and soupselect as zombie just didn't bite there.

These days I like to use combination of cheerio and request. Getting this combo to work on various environments is easier than Zombie. In addition you get to operate on a familiar jQuery syntax so that's a big bonus as well.

Basic Workflow

When it comes to scraping the basic workflow is quite simple. During development it can be useful to stub out functionality and fill it in as you progress. Here is the rough approach I use:

  1. Figure out how the data is structured currently
  2. Come up with selectors to access it
  3. Map the data into some new structure
  4. Serialize the data into some machine readable format based on your needs
  5. Serve the data through a web interface if you so want

It can be helpful to know how to use Chrome Developer Tools or Firebug effectively. SelectorGadget bookmarklet may come in handy too. If you feel like it, play around with jQuery selectors in your browser. It will be very useful to be able to compose selectors effectively.

Examples

Shady Customer by Petur
(CC BY-NC-ND)
sonaatti-scraper scrapes some restaurant data. It uses node.io, comes with a small CLI tool and makes it possible to serve the data through a web interface.

There is some room for improvement. It would be a good idea not to scrape the data each time a query is performed to the web API for instance. There should be a cache of some sort to avoid unnecessary polling. It is a good starting point, though, given its simplicity.

My other example, jklevents, is based on zombie cheerio. It is a lot more complex as it parses through a whole collection of pages, not just one. It also performs tasks such as geocoding to further improve the quality of data.

In my third example, f500-scraper, I had to use a combination of tools as zombie didn't quite work. The issue had something to do with the way the pages were loaded using JavaScript so the DOM just wasn't ready when I needed to scrape it. Instead I ended up just capturing the page data the good old way and applying some force on it. As it happens it worked.

lte-scraper uses cheerio and request. The implementation is somewhat short and may be worth investigating.

Other Considerations

When scraping, be polite. Sometimes the "targets" of scraping might actually be happy that you are doing some of the work for them. In case of jkl-event-scraper I contacted the right holder of the data and we agreed on an attribution deal. So it is alright to use the data in a commercial way given there is an attribution.

This is just a point I wanted to make as there are times when good things can come out of these sort of things. In the best case you might even earn a client this way.

Conclusion

Node.js is an amazing platform for scraping. The tooling is mature enough and you can use familiar query syntax for instance. It does not get much better than that for me at least. I believe it could be interesting to try to apply fuzzier approaches (think AI) to scraping.

For instance in case of restaurant data this might lead into a more generic scraper you can then apply to many pages containing that type of data. After all there is a certain structure to it although the way it has been structured in DOM will always vary somewhat.

Even the crude methods described here briefly are often quite enough. But you can definitely make scraping a more interesting problem if you want to.