Nikita the Spider

About Validation Mode

HTML and XHTML look pretty similar from the point of view of a Web page author, but they have roots in different markup languages -- SGML and XML respectively. SGML and XML are different languages with different rules and are therefore validated differently. (W3.org has available a very detailed description of the differences between SGML and XML written by James Clark. Mr. Clark is also the author of OpenSP which is the validation software at the heart of both Nikita and the W3C Validator.)

Since SGML and XML are different, it's important that a validator like Nikita uses the correct validation mode for your documents. On the modern Web, it's not always easy to decide which mode a document uses because the text/html media type (HTTP Content-type header) is used to deliver both HTML (SGML) and XHTML (XML).

There are many arguments against sending XHTML labelled as text/html but they're beyond the scope of this article. Instead of taking a bold stand on the handling of XML sent as text/html, Nikita simply echoes the behavior of the W3C validator when it comes to choosing a validation mode since that's the behavior people have learned to expect from a validator.

There are three factors that can influence which validation mode Nikita uses -- the media type (sometimes also called MIME type and content type), the doctype, and whether or not the document contains an XML declaration. The algorithm that Nikita uses looks like this:

/* Default mode is SGML. */
ValidationMode = SGML
if the media type is XMLish [1]
   ValidationMode = XML
else
   /* implied ==> media type is text/html */
   Sniff the doctype
   if doctype is known [2]
      ValidationMode = mode implied by the doctype [3]
   else
      /* implied ==> doctype is not present or is unknown */
      if an XML declaration is present
         ValidationMode = XML

Footnotes for the code above:

  1. "XMLish" media types include application/xhtml+xml and text/xml.
  2. Nikita's "known" doctypes are limited to a specific list of formal public identifiers, but it's rare to find a Web page that uses a formally recognized doctype that falls outside of this list. If a page has no doctype at all, that counts as "unknown".
  3. The validation mode implied by the doctype is pretty straightforward. HTML doctypes (e.g. HTML 4.01 Strict, Transitional, HTML 3.2, etc.) imply SGML mode, XHTML doctypes (XHTML 1.1, XHTML 1.0 Strict, etc.) imply XML mode.
This table summarizes the validation mode that the algorithm above will choose given all possible combinations of media type, doctype and XML declaration. The numbers in the first column correspond to the numbers of Nikita's validation mode test pages.
  Media Type Doctype XML Decl Present Validation Mode Warnings Issued by Nikita
Footnotes
1. For documents sent with a media type of text/html and an XHTML 1.0 doctype, Nikita won't issue a warning unless you specifically ask her to do so. XHTML 1.1 documents sent as text/html always generate a warning.
01 text/html HTML Yes SGML None
02 text/html HTML No SGML None
03 text/html XHTML Yes XML None, or media type/doctype conflict(1)
04 text/html XHTML No XML None, or media type/doctype conflict(1)
05 text/html Unknown Yes XML Missing doctype
06 text/html Unknown No SGML Missing doctype
07 XMLish HTML Yes XML Media type/doctype conflict
08 XMLish HTML No XML Media type/doctype conflict
09 XMLish XHTML Yes XML None
10 XMLish XHTML No XML None
11 XMLish Unknown Yes XML Missing doctype
12 XMLish Unknown No XML Missing doctype