Tuesday, July 6, 2010

Python: A nasty problem with encoding specification or how to read html text properly.

I was wondering why my own blog site "Doc Ernie's adventure" was not processed properly by my Python program which should create something like this:


"Doc Ernie's adventure"

. The top line of my multireportgen contains the following encoding:
# -*- coding: utf-8 -*-.

Yet the Python program could not read the string properly. Viewing the page source of the Topblog's page shows that our blog title is "Doc Ernie's adventure" Note the escaped coding Ampersand, Sharp sign, 039 code.
Then I went back today at the Python documentation and found other encoding values such as Latin1 and
iso-8859-15. When I replaced utf-8 with the 8859-15 value, the report was outputted properly with the link to the url of the blog shown!

Incidentally I saw this message from Emacs:


Selected encoding utf-8-unix disagrees with iso-latin-9-unix specified by file contents. Really save (else edit coding cookies and try again)? (yes or no)


So it was just a matter of encoding specification! Visit the corrected Topblogs July 6 report.

No comments:

Post a Comment