Nikita the Spider

Encoding Divination

divine (verb) -- 1. To know by inspiration, intuition, or reflection. 2. To guess.

Some of you may remember the days of computing past when seven-bit ASCII and green-on-black terminals were good enough for everyone. They had to be good enough, because that was all that was available. (Of course, back in my day we only had one bit to work with, and it was always zero, and our terminals were black-on-black, but you didn't hear us complaining about it, not like kids these days, etc. etc.) Thankfully the seven-bit ASCII limitation is long gone, and we live in the bright, shiny Unicode present.

The Eighth Bit...and Beyond!

That brings up the question of how all of those non-ASCII characters should be represented bytewise, which is where encodings come in. A document fetched via the Web falls under the purview of both the HTTP specification and the specification of the language in which the document is written (usually HTML, XHTML or XML). Not only is there more than one spec involved, each offers at least one unique way to decide what a document's encoding is. It's no surprise, then, that it isn't a simple process. It involves a little inspiration, intuition, reflection, and, yes, guesswork.

The flowchart below describes how Nikita divines the encoding of a Web document. You can click on any white shape (the diamonds and boxes) to learn more about that step. First, some terminology: what I call an encoding is also called a character set. Some of the specifications that I quote use the latter term. Also, when I say the HTTP spec I am referring to both HTTP 1.0 (RFC 1945) and HTTP 1.1 (RFC 2616). They don't differ on the subject of encodings, so I can safely lump them together.

Is there an encoding in the HTTP charset header? Does the file contain a Byte Order Mark? Is there a META element that contains an http-equiv directive? Parse the file for a META element that contains an http-equiv directive Is there an XML encoding in that line? Determine the encoding family as described below Extract the first line of the file Decode the line based on the encoding family What is the document's media type?
This flowchart describes how Nikita divines a document's encoding.
This flowchart is also available as a PDF in both US Letter and A4 format.

I created the flowchart and PDFs with a terrific program called OmniGraffle.

Media Type

The HTTP Content-Type header defines both the media type and the charset (encoding). A typical Content-Type header looks like this:

Content-Type: text/html; charset=UTF-8;

The "charset" portion is optional and is often not present.

Encoding/Charset

As described above, the charset can be specified in the HTTP Content-Type header. It is optional there, and the HTTP 1.1 spec is very clear about what to do when the charset is missing: When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. The HTTP 1.0 spec says the same.

The HTML 4.01 specification passes harsh judgement on this portion of the HTTP spec. It says, The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter. (emphasis mine)

In practice, when the encoding isn't specified most browsers either guess (using their own encoding diviniation tricks) or default to win-1252 which is a superset of ISO-8859-1. Since Nikita's goal is to help you to comply with published specs, she sticks with the HTTP-specified default of ISO-8859-1.

BOM (Byte Order Mark)

A BOM is a control character placed at the absolute beginning of a file that can give user agents a clue as to the document's encoding. The BOM is cleverly defined to be inoffensive to most software that doesn't make a special effort to handle it, but some programs will still complain or get confused when reading files with a BOM.

One advantage of using BOMs is that they're still informative even when the file is opened in a non-HTTP context (for instance, by a text editor).

The HTML 4.01 specification mentions BOMs for UTF-16-encoded documents in section 5.2.1. But it ignores them in section 5.2.2 where it lists a user agent's priorities for determining the character encoding. In fact, I can't find any spec that says whether or not a BOM trumps a charset specified in a META http-equiv statement, so my description of the BOM as the second-most authoritative source for the encoding is driven by practical issues and not by a spec.

Content-Type specified in a META http-equiv

The HTML spec makes provisions for specifying HTTP headers inside HTML documents via a META http-equiv element. In practice, most user agents will read a document's encoding from a META http-equiv even though support for reading other faux headers specified via META http-equiv is weak. Here's an example of this technique:

<meta http-equiv="content-type" content="text/html; charset=UTF-8">

The tricky part to reading the encoding from the document is that the user agent has to know the document's encoding in order to read it. This seeming Catch-22 is mitigated by the fact that the most commonly used encodings are in the ASCII family (see the discussion of encoding families below), so user agents can decode the file piece by piece with ISO-8859-1 or something similar and hope for the best. Once the user agent finds the META http-equiv, it can stop guessing and decode the file using the correct encoding. This implies that specifying a non-ASCII based encoding (such as UTF-16 or EBCDIC) via this technique is risky because it relies on the user agents being able to guess the correct encoding in order to be able to parse the document.

Encoding families

An encoding family is a group of encodings which map ASCII characters to the same values using the same per-character width (e.g. single byte versus double byte). The most familiar encoding family for most of us is the ASCII family, which includes US-ASCII, UTF-8, ISO-8859-1, ISO-8859-2...ISO-8859-15, Win-1252, KOI-8, Shift-JIS, and more. All of these encodings represent the ASCII character set with bytes 0x00 - 0x7f. Another family is one I'll call UTF-16BE after its most popular member. According to the W3C, big-endian ISO-10646-UCS-2 is also a member of this family. A third family is EBCDIC, which has a number of different flavors.

All told, there are eight different encoding families, which I further categorize into four superfamilies:

XML files are bound by rules that make it possible to (sometimes) divine the encoding by yet another method that makes use of these encoding families. XML files that contain a declaration must start with the characters <?xml. Because these five characters have a different representation in each of the encoding families described above, they serve as a low-quality BOM. A user agent that wants to parse such a file looks for bytes that represent "<?xml" in each possible representation. Once found, this gives enough of a clue to process the first line of the XML file which may contain an exact encoding declaration.

Section F of the XML spec describes the low-quality BOM. Section F's official title Autodetection of Character Encodings conveys much more confidence than the name of its anchor: "sec-guessing".

The XML encoding

The encoding of an XML document is often declared at the beginning of the file as defined in the XML specification section about text declaration.

Note: the fact that text/xml documents default to US-ASCII surprises a lot of people (including me), but the HTTP spec is very clear about it.