Nikita the Spider

By The Numbers – Fall 2008

Introduction

This is a followup to the Spring 2008 article of the same name. It describes some statistics of what Nikita has seen on the sites she's recently crawled.

This article touches on some things that the first didn't cover. In addition to the statistics about things like which validation messages, document type and media type are most common, you'll find out how often Webmasters send correct HTTP headers, and whether or not they specify encodings correctly.

The data for each topic below are summarized as a table and, where appropriate, a pie chart. The data are an aggregate of the sites that Nikita has seen; no particular site is singled out. You can jump directly to:

The Corpus

The data come from all of Nikita's most recent crawls. I ignored crawls that met any of the following criteria –

After discarding the crawls that didn't meet the criteria above, I was left with 1021 crawls of unique sites. From each of these, the program selected 25 pages at random for a corpus of 25525 pages. (25525 = 1021 × 25)

The Basics

Nikita validates pages in either HTML (SGML) or XML mode. (You can read more about the implications of the validation mode and how Nikita chooses.) In the sample, 19140 (almost exactly ¾) were "XMLish", i.e. probably some variety of XHTML. The remainder were almost all HTML.

Nikita couldn't determine the validation mode for about .5% of pages. This happens when the pages are empty, unreadable, etc.

Of the XMLish pages, 8315 (43%) had no validation errors, while this was true for just 1372 (22%) of the HTML pages. Overall, 9687 of the pages (38%) were free of validation errors.

Some pages didn't even pass the most fundamental tests. Nikita found 92 (.3%) with encoding errors that made them unreadable and 579 (2%) that had no title. (The latter includes pages with empty <title> elements as well as those where the tag was missing entirely.)

HTTP Headers

Nikita examines error-prone headers for problems, and headers containing dates that don't conform to one of the formats in RFC 2616 §3.3 are particularly common.

Of the 7648 Expires headers sent in the sample, 964 (almost 13%) were malformed. Of these, 480 (about half) were either 0 or -1. Although RFC 2616 specifically reminds clients to watch out for 0 in the Expires header, it also clearly states that zero is invalid.

Last-Modified headers fared much better, with only 95 of 7809 (1%) containing invalid dates.

Last-Modified and Expires, along with E-Tag and Cache-Control, are caching headers. 16799 (66%) of pages were accompanied by at least one of these headers.

Validation Messages

There are a total of 739510 validation messages in the sample containing over 9100 unique messages. I decided (somewhat arbitrarily) to show only the 25 most frequent messages in the table below. These 25 represent a little more than 43% of the sample.

The most frequent message 'reference not terminated by REFC delimiter' usually comes from an unescaped ampersand (&, ASCII 38), and they're common in URLs like this one:
<a href="http://example.com/foo.php?id=42&question=">

Note that the message 'required attribute "ALT" not specified' appears twice, once with the attribute in uppercase and once in lowercase (an HTML/XHTML difference). If I was to combine these two entries, it would be the most common error message.

Instances Portion of Total Message
53646 7.25% reference not terminated by REFC delimiter
47952 6.48% reference to external entity in attribute value
39623 5.36% required attribute "ALT" not specified
29138 3.94% required attribute "alt" not specified
16211 2.19% end tag for "img" omitted, but OMITTAG NO was specified
14884 2.01% an attribute value specification must be an attribute value literal unless SHORTTAG YES is specified
13265 1.79% end tag for "br" omitted, but OMITTAG NO was specified
10947 1.48% character "&" is the first character of a delimiter but occurred as data
9373 1.27% non SGML character number 11
9055 1.22% end tag for "FONT" omitted, but its declaration does not permit this
7877 1.07% an attribute value must be a literal unless it contains only name characters
7543 1.02% document type does not allow element "p" here; missing one of "object", "applet", "map", "iframe", "button", "ins", "del" start-tag
7195 0.97% document type does not allow element "P" here; missing one of "APPLET", "OBJECT", "MAP", "IFRAME", "BUTTON" start-tag
5849 0.79% end tag for "input" omitted, but OMITTAG NO was specified
5487 0.74% document type does not allow element "li" here; missing one of "ul", "ol", "menu", "dir" start-tag
5281 0.71% element "TD" undefined
5246 0.71% required attribute "TYPE" not specified
5234 0.71% required attribute "type" not specified
4549 0.62% character "<" is the first character of a delimiter but occurred as data
4185 0.57% document type does not allow element "div" here; missing one of "object", "applet", "map", "iframe", "button", "ins", "del" start-tag
4037 0.55% document type does not allow element "LI" here
3935 0.53% character data is not allowed here
3886 0.53% element "O:P" undefined
3854 0.52% element "A" undefined
3634 0.49% element "P" undefined

Encodings

A page's encoding is sometimes also called its "character set" or "charset" for short.

In the pie chart and table below you can see that the vast majority of encoding declarations are either UTF-8 or ISO-8859-1. About ⅔ of the sample (67%) declare an encoding of UTF-8 and three in ten (30%) declare ISO-8859-1. The third most popular is Windows-1252 (an ISO-8859-1 superset) with a little over 1% of the total.

When no encoding is specified, Nikita usually defaults to ISO-8859-1 per HTTP rules.

This pie chart is a graphical representation of the data in the table below.

In the table below, the total number of encodings found is greater than the number of pages in the sample because many pages declare multiple encodings. (Nikita will warn you if she finds multiple encodings declared for a page on your site.)

Instances Portion of Total Encoding
* I counted the encodings "UTF-8" and "UTF8" together. There was only one occurrence of the latter.
24958 67.15% UTF-8
11056 29.75% ISO-8859-1
539 1.45% WINDOWS-1252
216 0.58% ISO-8859-2
92 0.25% US-ASCII
81 0.22% WINDOWS-1250
51 0.14% ISO-8859-9
35 0.09% ISO-8859-15
27 0.07% WINDOWS-1251
25 0.07% ISO8859-1
25 0.07% WINDOWS-1257
25 0.07% WINDOWS-1253
24 0.06% WINDOWS-1255
8 0.02% WINDOWS-1254
4 0.01% GB2312
1 0.00% KOI8_U

Encoding Sources

Web pages can declare their encoding in four different places: in the HTTP Content-Type header, inside the HTML in the META HTTP-EQUIV Content-Type tag, in the file's BOM, or in the XML declaration. (You might want to read about how Nikita divines a Web page's encoding from this jumble of information.)

In the pie chart and table below you can see that almost ⅔ of the encodings are specified via a META tag and most of the rest were declared via HTTP.

This pie chart is a graphical representation of the data in the table below.

In the table below, the number of encoding sources is greater than the number of pages in the sample because many pages declare their encoding in multiple places. (Nikita will warn you if she finds multiple encodings declared for a page on your site.)

Instances Portion of Total Source
22304 60.01% META HTTP-equiv tag
13359 35.94% HTTP response header
1093 2.94% Fallback to default
411 1.11% The file's BOM (byte order mark)

Doctypes

There are 41 distinct doctypes in the sample. For purposes of discussing them here, I considered doctypes the same if they used the FPI (Formal Public Identifier). For example, these two doctypes were considered equivalent:

I chose not to display doctypes that represented < 1% of the sample.

In the pie chart and table below, you can see that XHTML doctypes represent more than ¾ of the sample, with HTML making up the remainder. Interestingly, transitional doctypes (both XHTML and HTML) dominate their respective fields.

This pie chart is a graphical representation of the data in the table below.

In the table below, the number of doctypes sums to less than the number of pages because I ignored doctypes that represented < 1% of the total.

Instances Portion of Total Source
11093 47.83% XHTML 1.0 Transitional
7390 31.86% XHTML 1.0 Strict
2517 10.85% HTML 4.01 Transitional
1099 4.74% HTML 4.01 Strict
603 2.60% XHTML 1.1 Strict
492 2.12% HTML 4.0 Transitional

Media Types

A media type is also sometimes called a "content type". nearly all of the pages in this sample (> 99.7%) use the text/html media type except for the two-tenths of one percent that use application/xhtml+xml.

Nikita's view of the world might be slightly skewed away from the application/xhtml+xml for a couple of reasons. First of all, Nikita sends an Accept header of */* and many servers capable of sending application/xhtml+xml might do so only if they find that exact string in the Accept header sent by the client.

Second, Nikita doesn't attempt to masquerade as a browser via her user agent string; it is simply "Nikita the Spider" and doesn't contain words like "Mozilla", "Opera", "WebKit", "KHTML", etc. that might convince servers that shes's capable of handling XHTML. It's likely that cautious servers choose to send text/html to Nikita.

The overwhelming dominance of text/html makes the pie chart and table below superfluous but they're here for completeness.

This pie chart is a graphical representation of the data in the table below.

Unlike some of the other properties discussed in this article, there's a simple 1:1 relationship between pages and media types. Each page has exactly one which is specified in the HTTP Content-Type header. (It's possible to specify something else in an HTML META HTTP-EQUIV tag, but this is non-standard and of no practical use. Nikita doesn't look for media types specified in the page contents.)

Instances Portion of Total Source
25469 99.78% text/html
56 0.22% application/xhtml+xml

Page Sizes

Page size reflects the byte count of the markup sent to Nikita. It doesn't include any external files like CSS, JavaScript, images, audio files, etc.

The mean page size of the sample is 20278 bytes, the median is 13795 bytes, and the mode is 42. As Douglas Adams would be quick to point out, this answer shows the value of knowing what the question is.

It's very useful to visualize page size data. The graph below shows the percentage of pages ≤ the size given on the X axis. About ¼ of the pages were 8KiB or smaller, ½ were 13KiB or smaller, ¾ were 23KiB or smaller and almost 99% were smaller than 100KiB.

This image is a graphical representation of the data in the table below.

Here's a tabular look at the page sizes. The X axis of the graph is on a constant scale which facilitates interpretation but sacrifices detail for values ≤ 15k which is where a lot of the action happens. This table remedies that.

Page size (KiB)% of sizes ≤
1 1.6%
2 3.4%
3 5.5%
4 8.3%
5 11.6%
6 15.8%
7 20.4%
8 25.1%
9 30.4%
1035.2%
1139.6%
1244.1%
1348.3%
1452.5%
1556.0%
2069.9%
2578.7%
3084.1%
3587.9%
4090.4%
4592.4%
5093.8%
6095.8%
7097.2%
8097.8%
9098.3%
10098.8%
12599.3%
15099.5%
17599.6%
20099.7%

Reference Counts

References from Nikita's perspective are any references to other entities. Nikita is aggresive about counting these because she wants to check everything that could possibly be broken. Therefore these counts include not just common <a href="..."> links, but also references to images, frames, scripts, CSS files, etc.

The mean reference count of pages in the sample is 62.9, the median is 47, and the mode is 37.

As with the page sizes above, it's especially helpful to visualize these data. The graph below shows that about ¼ of the pages had 29 or fewer references, ½ had 47 or fewer, ¾ had 75 or fewer and almost 94% had fewer than 150.

This graph has a long tail (not shown); to bring the curve up to 99% would require doubling the width of the graph.

This image is a graphical representation of the data in the table below.

Here's a tabular look at the reference counts. As mentioned above, the graph doesn't show the highest values for lack of space. This table shows what the graph can't.

References% ≤
5 3.1%
10 5.2%
15 9.4%
20 14.7%
25 20.0%
30 27.1%
35 33.7%
40 41.0%
45 47.8%
5054.5%
6064.7%
7072.1%
8077.3%
9081.4%
10084.7%
12590.7%
15094.2%
17596.2%
20097.5%
25098.7%
30099.1%
50099.8%

Conclusion

Webmasters have a bit to learn about HTTP date formats given that more than one in ten Expires headers were wrong. Since RFC 2616 states that invalid dates are to be interpreted as "already expired", a misconfigured Web server can make an entire site uncacheable.

In addition to 0 and -1 being used in the Expires header, I often see this invalid value:

   Sat, 1 Jan 2000 00:00:00 GMT

The "Full Date" section of RFC 2616 which describes valid date formats says that the day of the month must have two digits and shows an example to this effect. That's a detail that only a computer could care about, but computers (not humans) are precisely the target audience for HTTP headers.

The rest of these observations are essentially unchanged from the spring By The Numbers article, which means that Nikita's users haven't changed much in the past six months.

The most frequent validation messages are probably not a great surprise to anyone who has hand-coded HTML. It's unfortunately easy to forget to escape ampersands as well as the alt attribute on img tags. Validators are excellent tools for catching these sorts of mechanical mistakes.

The frequent use of META tags to specify encodings speaks to a common inability to set encodings via HTTP headers, or an unawareness of that ability.

The most surprising observation to me is the widespread use of the transitional doctypes. At the time these data were collected in late 2008, the transitional doctypes were already more than eight years old which is about half the lifetime of the Web itself. That's quite the lengthy transition...

About Bias

I believe my sample is a sufficiently random representation of what Nikita sees, but what Nikita sees doesn't represent the Web as a whole. For one thing, Nikita has mostly been promoted in English-speaking venues. Also, it's reasonable to assume that those who ask Nikita to crawl their site are more aware of Web standards than the average Web site author. To quote How To Lie With Statistics, "The result of a sampling study is no better than the sample it is based on."