Spider by Josch13 (CC0) |
In order to deal with this issue in a recent project I ended up writing a little framework that makes it easier to write scrapers. spyder provides structure for both writing and running your scrapers. It relies on two basic concepts: indexing and scraping.
In the first pass it executes an indexer. The purpose of this is to extract links to scrape. In the second pass, scraping, it makes sure each link is scraped and then helps you to deal with the results. You can also control the delay between individual scrapes to avoid hitting the server too hard. I'm not sure if that actually helps measurably but I guess it doesn't hurt.
Even though I have used the tool scraping for web, this doesn't mean you cannot use it beyond it. I can imagine it could be useful for extracting insights out of a filesystem. You could set it up to run against a directory of images and analyze them for instance.
If you are into indexing and scraping, check it out. I'm open for improvement requests.