[MySQL] Unicode chars outside BMP silently TRUNCATED, which is HIDDEN BY MEMCACHED

[MySQL] Unicode chars outside BMP silently TRUNCATED, which is HIDDEN BY MEMCACHED

Issue ID:2833
Issue Category:bug
Component:core
Priority:critical
Status:active
Assigned:brion
Version:0.9
Milestone:1.0
Keywords:bmp, characters, unicode, utf-8

http://identi.ca/notice/56240470 contains the following text:

Updates

#1

All the code got hosed. here's the text: http://pastebin.ubuntu.com/513930/

#2

MySQL 5.x's utf8 support truncates any unicode strings at the first non-BMP character, which I assume is the problem, since dents on identi.ca work fine for a while until they fall out of the cache.

#3

Title:Unicode chars outside BMP silently dropped» [MySQL] Unicode chars outside BMP silently dropped
Milestone:» 1.0
Version:» 0.9

MySQL 5.5 adds full Unicode support in the alternate "utf8m4" charset for UTF-8:

http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

and also adds a "utf16" charset to supplement the BMP-only "ucs2":

http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf16.html

In 6.0, the "utf8m4" form will be renamed to regular "utf8" ("utf8m3" will be the new name of the old 3-byte UTF-8 charset):

http://dev.mysql.md/doc/refman/6.0/en/charset-unicode-sets.html

Looks like our live servers are running 5.1 still, so until we upgrade and convert the fields we won't be able to properly support these chars.

Note that using utf8m4 will probably increase the size of indexes on text/varchar columns from 3 to 4 bytes per character; we may need to watch out for too-long indexes on some tables.

I'll make sure the new schema stuff going into 1.0.x supports the utf8mb4 mode when available, so it should be a clean transition when we do upgrade.

(PostgreSQL's UTF-8 support handles up to 4-byte chars at least as of the 8.3 docs I'm peeking at, so we shouldn't need to jump through any hoops there; just for MySQL.)

#4

Title:[MySQL] Unicode chars outside BMP silently dropped» [MySQL] Unicode chars outside BMP silently TRUNCATED, which is HIDDEN BY MEMCACHED
Priority:normal» critical

Added scare caps to the issue title; dropping individual chars is fairly harmless but truncating entire messages is.... not so hot. It could also cause display errors due to unclosed tags after the truncated portion of the message.

Note that per http://status.net/open-source/issues/3002 the truncation can actually be hidden by the way we preemptively cache database objects into memcached -- just like we don't get default values from the DB scheme this way, we also don't see the actually-saved truncated versions of the text etc. But, anything that pulls direct from the DB, or anything after the 24-hour memcache expiry, will show up the truncated version, leading to confusion about whether things worked.

We should either adjust MySQL's error handling to actually *reject* inserts instead of truncating, or do some filtering in there somewhere.

#5

Getting MySQL to avoid silently truncating means switching into 'strict mode', which is apparently kind of a pain and we may have other issues running under it still.

I'll hack up a quick function to strip the chars from content prior to saving in Notice and Message, and that should take care of the most important cases. (Things like bios are also affected but don't save HTML versions, so are less explody if they get truncated.)

#6

Just had this happen with http://identi.ca/notice/93198620 which ended with U+1F44D and a U+263A.
The U+263A ☺ should at least have stayed if #5 is implemented?

However, the post was passed straight through to Twitter unharmed ‐ https://twitter.com/johndrinkwater/status/196711964532154368

Login or Register to modify this issue, or to receive updates by email.

You can also subscribe to the RSS feed for updates to this issue.