This series of messages introduces my current hacks that automate web sites crawling and data extraction from HTML pages. The current output of these scripts is a bunch of CSV files that can be further processed … in Excel. I wish I would
output RDF instead of CSV. So there remains much room for further improvement (see
RDF Web Scraper for a similar but approach). Anyway… Here is
part One : how to crawl complex web sites with Python ?.
The next part will deal with data extraction from the retrieved web pages, involving much HTML cleansing and parsing.
No comments:
Post a Comment