2006-07-02

MySQL 4.1中涉及字符编码的变量

有多个有关character set 和 collation的变量会影响MySQL客户端与MySQL服务器的交互操作,对于在数据库中使用非英语的语言时,这些变量如果没有正确设置,就会导致出现乱码。

MySQL服务器端的character set 和 collation是由变量character_set_server 和 collation_server分别来决定的,创建数据库时默认的character set 和 collation则是由变量character_set_database and collation_database来指定的。

MySQL客户端也有一些与server连接相关的变量。
character_set_client: 服务器端根据这个变量来决定客户短发送的SQL的character set。

character_set_connection and collation_connection: 服务器端读取character_set_connection变量的值把客户短发送过来的SQL statements从character_set_client 转换成 character_set_connection,而collation_connection变量则对于比较字符串有影响。

character_set_results: 这个变量给出了服务端返回查询结果给客户端时,返回结果(结果集的metadata,列的值,列名等)所用的character set。

通常服务器端对这些变量都会有一个默认的设置,客户端可以根据自己需要调整一些变量的设置,如:
SET NAMES 'charset_name'
SET CHARACTER SET 'charset_name'

SET NAMES 语句给出了客户端发送statements所用的character set,并且制定了服务器端再返回结果是应该使用的character set。如果执行 SET NAMES 'cp1251' 就表明“随后从客户端送来的信息都是character set cp1251,返回给客户端的结果也是用character set cp1251“。

SET NAMES 等价于下面三条语句:
SET character_set_client = x;
SET character_set_results = x;
SET character_set_connection = x;

如果character_set_connection的值为x,那么collation_connection的值也就相应的被设置为x。要特别指定collation的值,使用语句:
SET NAMES 'charset_name' COLLATE 'collation_name'

SET CHARACTER SET语句与SET NAMES语句相似,但是把连接(connection)的character set 和collation设置为默认数据库的character set 和collation。SET CHARACTER SET语句等价于:
SET character_set_client = x;
SET character_set_results = x;
SET collation_connection = @@collation_database;

设置collation_connection的值同时也设定character_set_connection的值为相应的chracter set。

当一个客户端连接时,客户端告诉服务器它想使用的character set的名字,服务器根据这个名字来设定character_set_client, character_set_results, and character_set_connection变量的值,也就是服务器端实际上是执行了一个SET NAMES操作。
Example: Suppose that column1 is defined as CHAR(5) CHARACTER SET latin2. If you do not say SET NAMES or SET CHARACTER SET, then for SELECT column1 FROM t, the server sends back all the values for column1 using the character set that the client specified when it connected. On the other hand, if you say SET NAMES 'latin1' or SET CHARACTER SET latin1 before issuing the SELECT statement, the server converts the latin2 values to latin1 just before sending results back. Conversion may be lossy if there are characters that are not in both character sets.

If you do not want the server to perform any conversion of result sets, set character_set_results to NULL:

SET character_set_results = NULL;

Note: Currently, UCS-2 cannot be used as a client character set, which means that SET NAMES 'ucs2' does not work.

To see the values of the character set and collation system variables that apply to your connection, use these statements:

SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';

2006.07.05:
今天才在网上发现,MySQL官方网站已经提供5.1版本的中文手册(preview)供网上查阅,暂时还不提供文档单独下载。

3 条评论:

  1. Welcome to the MySQL monitor. Commands end with ; or \g.
    Your MySQL connection id is 185463 to server version: 4.1.13

    Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

    mysql> show Variables like "%character%";
    +--------------------------+----------------------------+
    | Variable_name | Value |
    +--------------------------+----------------------------+
    | character_set_client | latin1 |
    | character_set_connection | latin1 |
    | character_set_database | latin1 |
    | character_set_results | latin1 |
    | character_set_server | latin1 |
    | character_set_system | utf8 |
    | character_sets_dir | /usr/share/mysql/charsets/ |
    +--------------------------+----------------------------+
    7 rows in set (0.00 sec)

    回复删除
  2. mysql> show Variables like "%collation%";
    +----------------------+-------------------+
    | Variable_name | Value |
    +----------------------+-------------------+
    | collation_connection | latin1_swedish_ci |
    | collation_database | latin1_swedish_ci |
    | collation_server | latin1_swedish_ci |
    +----------------------+-------------------+
    3 rows in set (0.00 sec)

    回复删除
  3. 注意区分character set 和 character encoding的区别:
    摘自:Elliotte Rusty Harold's Effective XML

    Character Set versus Character Encoding

    XML is based on the Unicode character set. A character set is a collection of characters assigned to particular numbers called code points. Currently Unicode 4.0 defines more than 90,000 individual characters. Each character in the set is mapped to a number, such as 64, 812, or 87,000. These numbers are not ints, shorts, bytes, longs, or any other numeric data type. They are simply numbers. Other character sets, such as Shift-JIS and Latin-1, contain different collections of characters that are assigned to different numbers, although there's often substantial overlap with the Uni code character set. That is, many character sets assign some or all of their characters to the same numbers to which Unicode assigns those characters.

    A character encoding represents the members of a character set as bytes in a particular way. There are multiple encodings of Unicode, including UTF-8, UTF-16, UCS-2, UCS-4, UTF-32, and several other more obscure ones. Different encodings may encode the same code point using a different sequence of bytes and/or a different number of bytes. They may use big-endian or little-endian data. They can even use non-twos complement representations. They may use two bytes or four bytes for each character. They may even use different numbers of bytes for different characters.

    Changing the character set changes which characters can be represented. For instance, the ISO-8859-7 set includes Greek letters. The ISO-8859-1 set does not. Changing the character encoding does not change which characters can be used—it merely changes how each character is encoded in bytes.

    XML parsers always convert characters in other sets to Unicode before reporting them to the client application. In effect, they treat other character sets as different encodings of some subset of Unicode. Thus, XML doesn't ever really let you change the character set. This is always Unicode. XML only lets you adjust how those characters are represented.

    回复删除