close
The Wayback Machine - https://web.archive.org/web/20101226081840/http://www.utf8.com/

UTF-8 and Unicode Standards


The Unicode Standard, Version 5.0

What is UTF-8?

UTF-8 stands for Unicode Transformation Format-8. It is an octet (8-bit) lossless encoding of Unicode characters.

UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets, where the number of octets depends on the integer value assigned to the Unicode character. It is an efficient encoding of Unicode documents that use mostly US-ASCII characters because it represents each character in the range U+0000 through U+007F as a single octet. UTF-8 is the default encoding for XML.

Standards

RFC 3629: UTF-8, a transformation format of ISO 10646. November 2003.
The Unicode Standard 5.0, November 2006. [purchase from Amazon.com]
In particular, see the informal description of UTF-8 in sections 2.5 and 2.6, pages 30-32, and a much more formal definition in sections 3.9 and 3.10, pages 77-81.
Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard

Articles and background reading

UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn
Forms of Unicode, an excellent overview by Mark Davis
Wikipedia UTF-8 contains a good discussion of why five- and six-octet sequences are now illegal UTF-8
Unicode Transformation Formats [czyborra.com]
Unicode UTF-8 FAQ
Unicode in XML and other Markup Languages: Unicode Technical Report #20
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), an amusing and informative article by Joel Spolsky

Character Sets

The MIME character set attribute for UTF-8 is UTF-8. Character sets are case-insensitive, so utf-8 is equally valid. [IANA Character Sets].

In an HTML file, place this tag inside <head> ... </head>:

<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">

In an XML prolog, the encoding is typically specified as an attribute:

<?xml version="1.0" encoding="UTF-8" ?>

In Apache server config or .htaccess, this will cause the HTTP header to be generated for text/html and text/plain content:

AddDefaultCharset UTF-8

Last modified: Sun Sep 14 14:27:06 PDT 2008