More about Alcides Fonseca
Go Back

WTF-8 problems

Written by Alcides Fonseca at 2007/06/18

As you might have noticed, I’ve been revamping my website in the last days.



One of the major problems was to deal with the usual stuff developers hate (including me) that Sérgio likes to call wtf-8: encoding issues.



We have a DreamHost account and the python we run there is ASCII. So all the website output was ISO-8859–1 but the content from the feeds ( using feedparser btw) were utf-8. After a few tries, I finally managed to get it right.



First I tried chardet to detect the encoding and then use encode and decode functions like this:




import chardet

for line in page.output().split(" "):

    try:

        it = chardet.detect(line)

        print line.decode(it['encoding']).encode('utf-8')

    except:

        print line


But this turned out not to be a good solution, because reading the feed I had to convert it to a useful encode, and I couldn’t encode into UTF-8 a UTF-8 string. So I looked for a better and easy solution and, guess what, I found it. When reading the UTF-8 feed, I had just to use




title=str(entry['title'].encode('ascii', 'xmlcharrefreplace'))


This will convert any character that can’t be converted to ascii into it’s XML Character Referente code. Problem solved.



One thing I also found and I’m looking forward to add to Pungi is a HTML tidy wrapper for Python. uTidylib or mxTidy. However I couldn’t make any of them work on DreamHosting hosting yet. If you had, feel free to help me :)