Nikita the Spider

By The Numbers – Spring 2008

Introduction

Statistics about Web page quality – like the frequency of certain validation error messages and the popularity of XHTML versus HTML doctypes – are pretty esoteric stuff, and likely not of great interest to most Webmasters. But they are of interest to the people who build the tools that build the Web, like the fine folks behind the W3C Validator.

If you're one of those people, or you're just interested in Web site quality, then you might be interested in this statistical overview of things that Nikita examines about each page she validates.

The data for each topic below are summarized as a table and, where appropriate, a pie chart. The data are an aggregate of the sites that Nikita has seen; no particular site is singled out. You can jump directly to:

Methods

In the spring of 2008 I sampled the data generated by Nikita's most recent crawls. The data exclude duplicate crawls of the same site and all sites where Nikita saw fewer than 30 pages. This left 360 crawls of unique sites. From each of these, the program selected 30 pages at random for a corpus of 10800 pages. (10800 = 360 × 30)

Validation Messages

There are a total of 540726 validation messages in the sample containing almost 5400 unique messages. I decided (somewhat arbitrarily) to show only the 25 most frequent messages in the table below. These 25 represent just over half (55%) of the sample.

Note that the message 'required attribute "ALT" not specified' appears twice, once with the attribute in uppercase and once in lowercase (an HTML/XHTML difference). If I was to combine these two entries, it would be the most common error message by a large margin. Even without combining them, it's still the most frequent.

Instances Portion of Total Message
41196 7.62% required attribute "ALT" not specified
40187 7.43% reference not terminated by REFC delimiter
37662 6.97% reference to external entity in attribute value
21067 3.90% character ";" not allowed in attribute specification list
20055 3.71% an attribute value must be a literal unless it contains only name characters
16985 3.14% character "%" is not allowed in the value of attribute "WIDTH"
16658 3.08% an attribute value specification must be an attribute value literal unless SHORTTAG YES is specified
15134 2.80% required attribute "alt" not specified
9669 1.79% end tag for "br" omitted, but OMITTAG NO was specified
8014 1.48% value of attribute "ALIGN" cannot be "MIDDLE"; must be one of "LEFT", "CENTER", "RIGHT", "JUSTIFY", "CHAR"
7768 1.44% document type does not allow element "P" here; missing one of "APPLET", "OBJECT", "MAP", "IFRAME", "BUTTON" start-tag
7528 1.39% end tag for "img" omitted, but OMITTAG NO was specified
7207 1.33% non SGML character number 0
6779 1.25% required attribute "TYPE" not specified
5457 1.01% character "&" is the first character of a delimiter but occurred as data
5371 0.99% document type does not allow element "A" here
4610 0.85% end tag for "FONT" omitted, but its declaration does not permit this
4336 0.80% end tag for element "A" which is not open
4285 0.79% non SGML character number 1
3695 0.68% reference to entity "action" for which no system identifier could be generated
3686 0.68% reference to entity "widgetId" for which no system identifier could be generated
3606 0.67% document type does not allow element "DIV" here; assuming missing "LI" start-tag
3059 0.57% value of attribute "ALIGN" cannot be "ABSMIDDLE"; must be one of "TOP", "MIDDLE", "BOTTOM", "LEFT", "RIGHT"
2805 0.52% end tag for element "FONT" which is not open
2624 0.49% document type does not allow element "li" here; missing one of "ul", "ol", "menu", "dir" start-tag

Encodings

A page's encoding is sometimes also called its "character set" or "charset" for short.

In the pie chart and table below you can see that the vast majority of encoding declarations are either UTF-8 or ISO-8859-1. About half of the pages in the sample (52%) declare an encoding of UTF-8 and four in ten (41%) declare ISO-8859-1. The third most popular is Windows-1252 (an ISO-8859-1 superset) with a meager 2% of the total.

When no encoding is specified, Nikita usually defaults to ISO-8859-1 per HTTP rules.

This pie chart is a graphical representation of the data in the table below.

In the table below, the total number of encodings found is greater than the number of pages in the sample because many pages declare multiple encodings. (Nikita will warn you if she finds multiple encodings declared for a page on your site.)

Instances Portion of Total Encoding
* I counted the encodings "UTF-8" and "UTF8" as the same. For the record, the former is the overwhelmingly more popular expression with 7,423 occurrences compared to 23 for the latter.
7446 52.43% UTF-8 *
5814 40.94% ISO-8859-1
309 2.18% WINDOWS-1252
180 1.27% ISO-8859-15
173 1.22% WINDOWS-1251
91 0.64% WINDOWS-1250
58 0.41% WINDOWS-1257
35 0.25% ISO-8859-9
30 0.21% ISO-8859-2
29 0.20% EUC-JP
23 0.16% BIG5
7 0.05% ISO-8859-7
3 0.02% US-ASCII
3 0.02% UTF-16LE
1 0.01% WINDOWS-1502
1 0.01% WINDOWS-1253

Encoding Sources

Web pages can declare their encoding in four different places: in the HTTP Content-Type header, inside the HTML in the META HTTP-EQUIV Content-Type tag, in the file's BOM, or in the XML declaration. (You might want to read about how Nikita divines a Web page's encoding from this jumble of information.)

In the pie chart and table below you can see that almost ⅔ of the encodings are specified via a META tag and most of the rest were declared via HTTP.

This pie chart is a graphical representation of the data in the table below.

In the table below, the number of encoding sources is greater than the number of pages in the sample because many pages declare their encoding in multiple places. (Nikita will warn you if she finds multiple encodings declared for a page on your site.)

Instances Portion of Total Source
9021 63.61% META HTTP-equiv tag
4299 30.32% HTTP response header
737 5.20% Fallback to default
123 0.87% The file's BOM (byte order mark)
1 0.01% The XML declaration

Doctypes

There are 41 distinct doctypes in the sample. For purposes of discussing them here, I considered doctypes the same if they used the FPI (Formal Public Identifier). For example, these two doctypes were considered equivalent:

I chose not to display doctypes that represented < 1% of the sample.

In the pie chart and table below, you can see that XHTML doctypes represent a little more than ⅔ of the sample, with HTML (of course) making up the remainder. Interestingly, transitional doctypes (both HTML and XHTML) dominate the field.

This pie chart is a graphical representation of the data in the table below.

In the table below, the number of doctypes sums to less than the number of pages because I ignored doctypes that represented < 1% of the total.

Instances Portion of Total Source
4123 43.93% XHTML 1.0 Transitional
1994 21.24% XHTML 1.0 Strict
1655 17.63% HTML 4.01 Transitional
533 5.68% HTML 4.01 Strict
249 2.65% HTML 4.0 Transitional
211 2.25% XHTML 1.1

Media Types

A media type is also sometimes called a "content type". nearly all of the pages in this sample (> 99.6%) use the text/html media type except for the one-third of one percent that use application/xhtml+xml.

Nikita's view of the world might be slightly skewed away from the application/xhtml+xml for a couple of reasons. First of all, Nikita sends an Accept header of */* and many servers capable of sending application/xhtml+xml might do so only if they find that exact string in the Accept header sent by the client.

Second, Nikita doesn't attempt to masquerade as a browser via her user agent string; it is simply "Nikita the Spider" and doesn't contain words like "Mozilla", "Opera", "WebKit", "KHTML", etc. that might convince servers that shes's capable of handling XHTML. It's likely that cautious servers choose to send text/html to Nikita.

The overwhelming dominance of text/html makes the pie chart and table below superfluous but they're here for completeness.

This pie chart is a graphical representation of the data in the table below.

Unlike some of the other properties discussed in this article, there's a simple 1:1 relationship between pages and media types. Each page has exactly one which is specified in the HTTP Content-Type header. (It's possible to specify something else in an HTML META HTTP-EQUIV tag, but this is non-standard and of no practical use. Nikita doesn't look for media types specified in the page contents.)

Instances Portion of Total Source
10765 99.68% text/html
35 0.32% application/xhtml+xml

Conclusion

The statistics above give an overview of what Nikita has seen on the sites she's crawled recently.

The most frequent validation messages are probably not a great surprise to anyone who has hand-coded HTML. It's unfortunately easy to forget the alt attribute on img tags, and the second-most common error ("reference not terminated by REFC delimiter") is a sure sign of unescaped ampersands (&, ASCII 38). Validators are excellent tools for catching these sorts of mistakes.

The frequent use of META tags to specify encodings speaks to a common inability to set encodings via HTTP headers, or an unawareness of that ability.

The most surprising observation to me is the widespread use of the transitional doctypes. At the time these data were collected in 2008, the transitional doctypes were already more than eight years old which is about half the lifetime of the Web itself. That's quite the lengthy transition...

About Bias

I believe my sample is a sufficiently random representation of what Nikita sees, but what Nikita sees doesn't represent the Web as a whole. For one thing, Nikita has mostly been promoted in English-speaking venues. Also, it's reasonable to assume that those who ask Nikita to crawl their site are more aware of Web standards than the average Web site author. To quote How To Lie With Statistics, "The result of a sampling study is no better than the sample it is based on."