Forums

MySQL utf-8

Hi,

I'm trying to insert some text from Twitter in a MySQL database, but I'm having problems with the emojis.

According to this, I would need to use the 'utf8mb4' character set instead of 'utf8'. The problem is that 'utf8mb4' is only available from MySQL 5.5.

So my question is: is there any way to solve this problem with MySQL 5.1?

Hm. I can imagine a hacky woraround where you store bytestring representations in the database rather than the bytes themselves? a comma-separated list of ints, which you could then convert back and forth from python strings?

In [2]: tweet
Out[2]: 'an alien! \xf0\x9f\x91\xbd'

In [3]: [ord(c) for c in tweet.decode('utf8')]
Out[3]: [97, 110, 32, 97, 108, 105, 101, 110, 33, 32, 128125]

In [7]: in_mysql = ','.join(str(ord(c)) for c in tweet.decode('utf8'))

In [8]: in_mysql
Out[8]: '97,110,32,97,108,105,101,110,33,32,128125'

Converting back to a string might be a bit harder. Probably easier in python 3. Which Python version are you using?

In the meantime, I'll upvote our ticket to get mysql upgraded to 5.5...

aha. 'unichr' is what we want in python 2, or just 'chr' in 3:

In [23]: ''.join(unichr(int(c)) for c in in_mysql.split(','))
Out[23]: u'an alien! \U0001f47d'

Thank you! I'm going to try that.

(I'm using python 3)

ok! things to be careful about -- it's likely to be a bit slow, and you won't be able to (easily) do any filters or lookups based on that table...

another option would be to just throw out any characters you can't store, or replace them with question-marks?

I would rather keep them. I'm gathering data on Japanese onomatopoeia, and emojis are part of the emotional content of the tweet...

I wasn't planning on doing filters on that column, so it should be ok for now :)

Thank you very much!

adding this link to our help page on mysql encoding, utf8, and character sets for anyone else that comes across this page!