Forums

Mysql select as utf-8

Hey,

I'm trying to select something from one of my database tables, but I want it to be selected as UTF-8 I read that you could do...

select * from table character set utf8;

This doesn't seem to be doing it, do anyone know of how you do this? I don't want to convert it after I have the data, hope that makes sense.

FYI, when inserting the data, I encoded it as UTF-8 so I want to select it already decoded.

Is this what you are talking about?

But If you are using python3 I think you will always have to specify how you want to decode the byte string.

I'd rather not alter the database, is this the only way? python27

If the column is a MySQL string type, then I think it's likely that you have garbage in the database at the moment since MySQL would have taken your utf-8 bytestream and encoded it as whatever the encoding for the column was. Unless you were inserting into a binary column, then you'd have to do the decoding yourself.

Hey Glenn, Thanks for your reply, I'm afraid this is where my database knowledge is stained a little, by garbage do you mean encoding? In my DB where the issue is, it's in there like so £60, is that classed as garbage? I didn't actually declare any character set on creation.

Table script as follows...

import MySQLdb

db = MySQLdb.connect("","", "", "")

cursor = db.cursor()

cursor.execute("DROP TABLE IF EXISTS links")

# Create table as per requirement
sql = """CREATE TABLE links (
     id INT NOT NULL AUTO_INCREMENT,
     project_id INT,
     cost TEXT, # column with £60
     archived INT,
     PRIMARY KEY (id))"""
cursor.execute(sql)
db.commit()
db.close()

To be a little more specific in my app I just add .decode('utf-8') within my template view (Using Flask/Jinja2) but I have also added a CSV download feature but it displays as \xc2\xa360 in that particular column (with the £ sign).

FYI I'm running .encode('utf-8') on the way in.

Any further guidance is much appreciated. :)

By garbage, I mean that you probably have a utf-8 encoded bytestream encoded again as some other encoding. So you would end up with things like £60 (there is actually a name for this: mojibake). From a quick test, it looks like the default for a text field on PythonAnywhere is latin1, so what you're looking at is a utf-8 encoded string reencoded as latin1.

There is a Python library for fixing this sort of thing. It's called ftfy I would suggest creating a new column in your table that has a utf-8 characterset and then copying the entries in cost to it using ftfy to fix the errors.

Joel Spolsky wrote an excellent article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) It's that article that got me from being hopelessly confused about unicode and character encodings to being able to at least work out what I'd done wrong.

On a completely non-unicode tangent: Why are you storing a cost as a text field? Wouldn't it make more sense as an integer field (perhaps with a different field for specifying the currency, if that changes)?

adding this link to our help page on mysql encoding, utf8, and character sets for anyone else that comes across this page!