Please Use utf8mb4

最近遇到这样一个问题,MySQL 执行报错:

查询出错 (1366): Incorrect string value: ‘\xF0\x9F\x98\x84\xF0\x9F…’ for column ‘nickName’ at row 1

执行的 SQL 语句是:

insert into user (nickName) values ('😄😄😄');

user 表中 nickName 字段的定义是:

`nickName` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '用户昵称'

照理说 utf8mb4 编码存储表情符号是没有问题的,于是参照 Stack Overflow 的回答把 vachar(255) 改成 varbinary(255),的确能够执行成功了。

但是问题接踵而来:查询的时候需要不区分大小写,然而 varbinary 类型不支持 case insenstive。

折腾了一番之后再回来看这个问题,一开始就不是 varchar 和 varbinary 的问题,而是 MySQL 连接的编码设置错了。

'db' => [
    'class' => 'yii\db\Connection',
    'dsn' => 'mysql:host=127.0.0.1;dbname=xxx',
    'username' => 'xxx',
    'password' => 'xxx',
    'charset' => 'utf8',
],

把 MySQL 连接的 charset 从 utf8 改成 utf8mb4 才是正确姿势。


通过这个问题,暴露了我对 UTF8 编码的了解还不足,所以在这里做下补充。

什么是 Unicode 和 UTF-8?

Unicode is a worldwide character-encoding standard. 

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

UTF-8 is a variable widthcharacter encoding capable of encoding all 1,112,064[1] valid code points in Unicode using one to four 8-bit bytes.[2] The encoding is defined by the Unicode standard, and was originally designed by Ken Thompson and Rob Pike.[3][4] The name is derived from Unicode (or Universal Coded Character SetTransformation Format – 8-bit.[5]

MySQL 中 utf8 和 utf8mb4 的区别。

  • utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.
  • utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.
  • utf8: An alias for utf8mb3.
  • ucs2: The UCS-2 encoding of the Unicode character set using two bytes per character.
  • utf16: The UTF-16 encoding for the Unicode character set using two or four bytes per character. Like ucs2 but with an extension for supplementary characters.
  • utf16le: The UTF-16LE encoding for the Unicode character set. Like utf16 but little-endian rather than big-endian.
  • utf32: The UTF-32 encoding for the Unicode character set using four bytes per character.

MySQL 8.0 中默认的 utf8mb4_0900_ai_ci 字符集是什么意思?

  • uft8mb4 means that each character is stored as a maximum of 4 bytes in the UTF-8 encoding scheme.
  • 0900 refers to the Unicode Collation Algorithm version. (The Unicode Collation Algorithm is the method used to compare two Unicode strings that conforms to the requirements of the Unicode Standard).
  • ai refers accent insensitivity. That is, there is no difference between e, è, é, ê and ë when sorting.
  • ci refers to case insensitivity. This is, there is no difference between p and P when sorting.

参考:

https://docs.microsoft.com/en-us/windows/desktop/Intl/unicode
http://www.unicode.org/standard/WhatIsUnicode.html
https://en.m.wikipedia.org/wiki/UTF-8
https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets.html
https://www.monolune.com/what-is-the-utf8mb4_0900_ai_ci-collation/

 561 total views,  2 views today