Forums

strange Unicode decoding behaviour

when start script from console there "Warning" and wrong word comparison:

06:52 ~/rssDj/rssReader $ python rssReader.py
begin
rssReader.py:253: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  if not (word in filterWords):

when i run same script from Eclipse, with same interpreter (python2.7) there is no UnicodeWarnings and all words cheked right; why that?

found out what's that; there's nothing strange: wrong encoding when reading words from file; there must be something like "unocode(word, errors='replace')" or something like that...

this solved it:

import codecs
f = codecs.open('unicode.rst', encoding='utf-8')

Yes, unicode is a tricky little beast. If you want to use non-ASCII anywhere then you pretty much have to deal with unicode properly everywhere, or you'll make a major headache for yourself down the line.

Of course, if you're writing code for other people to use (e.g. open source) then you should assume you'll be dealing with non-ASCII, but this is a habit so foreign to many software engineers that it's often not considered properly in the design.

The whole charset implementation is hell, if you aren't from the US.

@Cartroo: presumably 3.3 is generally easier for folk, therefore recommended?

Jim

It definitely looks like it will be better; the problem (and the reason why we supported 2.7 web apps first) is that so many libraries haven't been ported over yet. But for anyone writing something new from the ground up without dependencies I'd definitely recommend 3.3.

Now we just need to get it to work in web apps...