Page Inspect

https://utf8.com/

Internal Links

External Links

Images

Headings

Page Content

Title:UTF-8 and Unicode Standards

Description:

HTML Size:15 KB

Markdown Size:2 KB

Fetched At:November 18, 2025

Page Structure

h1UTF-8 and Unicode

h2Standards

h2Articles and background reading

h2Character Sets

Markdown Content

UTF-8 and Unicode Standards

# UTF-8 and Unicode

**U**nicode **T**ransformation **F**ormat **8**\-bit is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32. \[1\]

Unicode Explained: Internationalize Documents, Programs, and Web Sites · *paid link*

UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets, where the number of octets depends on the integer value assigned to the Unicode character. It is an efficient encoding of Unicode documents that use mostly US-ASCII characters because it represents each character in the range U+0000 through U+007F as a single octet.

UTF-8 is the default encoding for XML and since 2010 has become the dominant character set on the Web.

## Standards

- RFC 3629: UTF-8, a transformation format of ISO 10646. November 2003.
- The Unicode Standard 5.0, November 2006. \[purchase from Amazon.com\]
- In particular, see the informal description of UTF-8 in sections 2.5 and 2.6, pages 30-32, and a much more formal definition in sections 3.9 and 3.10, pages 77-81.

## Articles and background reading

- UTF-8 and Unicode FAQ for Unix/Linux by Markus Kuhn
- Forms of Unicode, an excellent overview by Mark Davis
- Wikipedia UTF-8 contains a good discussion of why five- and six-octet sequences are now illegal UTF-8
- Unicode Transformation Formats \[czyborra.com\]
- The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), an amusing and informative article by Joel Spolsky

## Character Sets

The MIME character set attribute for UTF-8 is `UTF-8`. Character sets are case-insensitive, so `utf-8` is equally valid. \[IANA Character Sets\].

In a modern HTML 5 page, place this tag inside `<head>` ... `</head>`:

In an XML prolog, the encoding is typically specified as an attribute:

<?xml version="1.0" encoding="UTF-8" ?>

Last modified: Fri Jun 14 11:47:29 PDT 2024