最近遇到这样一个问题,MySQL 执行报错:
查询出错 (1366): Incorrect string value: ‘\xF0\x9F\x98\x84\xF0\x9F…’ for column ‘nickName’ at row 1
执行的 SQL 语句是:
insert into user (nickName) values ('😄😄😄');
user 表中 nickName 字段的定义是:
`nickName` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '用户昵称'
照理说 utf8mb4 编码存储表情符号是没有问题的,于是参照 Stack Overflow 的回答把 vachar(255) 改成 varbinary(255),的确能够执行成功了。
但是问题接踵而来:查询的时候需要不区分大小写,然而 varbinary 类型不支持 case insenstive。
折腾了一番之后再回来看这个问题,一开始就不是 varchar 和 varbinary 的问题,而是 MySQL 连接的编码设置错了。
'db' => [
'class' => 'yii\db\Connection',
'dsn' => 'mysql:host=127.0.0.1;dbname=xxx',
'username' => 'xxx',
'password' => 'xxx',
'charset' => 'utf8',
],
把 MySQL 连接的 charset 从 utf8 改成 utf8mb4 才是正确姿势。
通过这个问题,暴露了我对 UTF8 编码的了解还不足,所以在这里做下补充。
什么是 Unicode 和 UTF-8?
Unicode is a worldwide character-encoding standard.
Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.
UTF-8 is a variable widthcharacter encoding capable of encoding all 1,112,064[1] valid code points in Unicode using one to four 8-bit bytes.[2] The encoding is defined by the Unicode standard, and was originally designed by Ken Thompson and Rob Pike.[3][4] The name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[5]
MySQL 中 utf8 和 utf8mb4 的区别。
utf8mb4
: A UTF-8 encoding of the Unicode character set using one to four bytes per character.utf8mb3
: A UTF-8 encoding of the Unicode character set using one to three bytes per character.utf8
: An alias forutf8mb3
.ucs2
: The UCS-2 encoding of the Unicode character set using two bytes per character.utf16
: The UTF-16 encoding for the Unicode character set using two or four bytes per character. Likeucs2
but with an extension for supplementary characters.utf16le
: The UTF-16LE encoding for the Unicode character set. Likeutf16
but little-endian rather than big-endian.utf32
: The UTF-32 encoding for the Unicode character set using four bytes per character.
MySQL 8.0 中默认的 utf8mb4_0900_ai_ci 字符集是什么意思?
uft8mb4
means that each character is stored as a maximum of 4 bytes in the UTF-8 encoding scheme.0900
refers to the Unicode Collation Algorithm version. (The Unicode Collation Algorithm is the method used to compare two Unicode strings that conforms to the requirements of the Unicode Standard).ai
refers accent insensitivity. That is, there is no difference between e, è, é, ê and ë when sorting.ci
refers to case insensitivity. This is, there is no difference between p and P when sorting.
参考:
765 total views, 1 views today