Sunday, October 17, 2010

hacks

This series of messages introduces my current hacks that automate web sites crawling and data extraction from HTML pages. The current output of these scripts is a bunch of CSV files that can be further processed … in Excel. I wish I would output RDF instead of CSV. So there remains much room for further improvement (see RDF Web Scraper for a similar but approach). Anyway… Here is part One : how to crawl complex web sites with Python ?. The next part will deal with data extraction from the retrieved web pages, involving much HTML cleansing and parsing.

No comments:

Post a Comment