Sunday, October 17, 2010
non restful
The second source of difficulties comes from non-RESTful sites. As an example the APEC site (a French Monster-like job site) is based on a proprietary web framework that implies that you cannot rely on links URLs to automate your browsing session. It took me some time to understand that, once loggin in, every time you click on a link, you are presented with a new frameset referring to the URLs that contain the interesting data you are looking for. And these URLs seem to be dependent on your session. No permalink, if you prefer. This makes the crawling process even more tricky. In order to deal with this source of difficulty when you write your crawling script, you have to open both your favorite text editor (to write the script) and your favorite web browser (Firefox of course !).
hacks
This series of messages introduces my current hacks that automate web sites crawling and data extraction from HTML pages. The current output of these scripts is a bunch of CSV files that can be further processed … in Excel. I wish I would output RDF instead of CSV. So there remains much room for further improvement (see RDF Web Scraper for a similar but approach). Anyway… Here is part One : how to crawl complex web sites with Python ?. The next part will deal with data extraction from the retrieved web pages, involving much HTML cleansing and parsing.
web scaping
Example One : I am looking for my next job. So I subscribe to many job sites in order to receive notifications by email of new job ads (example = Monster…). But I’d rather check these in my RSS aggregator instead of my mailbox. Or in some sort of aggregating Web platform. Thus, I would be able to do many filtering/sorting/ranking/comparison operations in order to navigate through these numerous job ads.
Monday, October 4, 2010
12:19:20
Python UnicodeEncodeError: 'ascii' codec can't encode character
If you've ever gotten this error, Django'ssmart_str function might be able to help. I found this from James Bennett's article, Unicode in the real world. He provides a very good explanation of Python's Unicode and bytestrings, their use in Django, and using Django's Unicode utilities for working with non-Unicode-friendly Python libraries. Here are my notes from his article as it applies to the above error. Much of the wording is directly from James Bennett's article. This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, "which handles exactly 128 (English) characters". This is why trying to convert Unicode characters beyond 128 produces the error.
The good news is that you can encode Python bytestrings in other encodings besides ASCII. Django's
smart_str function in the django.utils.encoding module, converts a Unicode string to a bytestring using a default encoding of UTF-8.12:14:15
The design of this module is loosely based on Java’s threading model. However, where Java makes locks and condition variables basic behavior of every object, they are separate objects in Python. Python’s Thread class supports a subset of the behavior of Java’s Thread class; currently, there are no priorities, no thread groups, and threads cannot be destroyed, stopped, suspended, resumed, or interrupted. The static methods of Java’s Thread class, when implemented, are mapped to module-level functions.
12:05:35 post
new test post..corrected some bugs in the code..Also added a line to print total active threads..lets see how it comes out...
- threading.activeCount
- Return the number of Thread objects currently alive. The returned count is equal to the length of the list returned by enumerate()
11:01:20 django-celery
had celery for lunch..django-celery is quite tasty..a bit of ghettoq and it goes in quite well..lets wait till it gets processed..
Websites often need tasks that run periodically, behind the scenes. Examples include sending email reminders, aggregating denormalized data and permanently deleting archived records. Very often the simplest solution is to setup a cron job to hit a URL on the site that performs the task.
Cron has the advantage of simplicity, but it's not not ideal for the job. You have to take steps to ensure that regular users of the site cannot hit those URLs directly. It also forces you to manage an external configuration. What if you forget to perform the configuration on the qa or production servers? It would be safer and easier if the configuration was in the code for the site.
For Django sites, celery seems to be the solution of choice. Celery is really focused on being a distributed task queue, but it can also be a great scheduler. Their documentation is excellent, but I found that they lack a quickstart guide for getting started with Django and celery, just for replacing cron.
Websites often need tasks that run periodically, behind the scenes. Examples include sending email reminders, aggregating denormalized data and permanently deleting archived records. Very often the simplest solution is to setup a cron job to hit a URL on the site that performs the task.
Cron has the advantage of simplicity, but it's not not ideal for the job. You have to take steps to ensure that regular users of the site cannot hit those URLs directly. It also forces you to manage an external configuration. What if you forget to perform the configuration on the qa or production servers? It would be safer and easier if the configuration was in the code for the site.
For Django sites, celery seems to be the solution of choice. Celery is really focused on being a distributed task queue, but it can also be a great scheduler. Their documentation is excellent, but I found that they lack a quickstart guide for getting started with Django and celery, just for replacing cron.
Sunday, October 3, 2010
Saturday, October 2, 2010
Subscribe to:
Comments (Atom)