This is a followup to the Spring 2008 article of the same name. It describes some statistics of what Nikita has seen on the sites she's recently crawled.
This article touches on some things that the first didn't cover. In addition to the statistics about things like which validation messages, document type and media type are most common, you'll find out how often Webmasters send correct HTTP headers, and whether or not they specify encodings correctly.
The data for each topic below are summarized as a table and, where appropriate, a pie chart. The data are an aggregate of the sites that Nikita has seen; no particular site is singled out. You can jump directly to:
The data come from all of Nikita's most recent crawls. I ignored crawls that met any of the following criteria –
This is a change from the previous article in which the data from the last crawl of a site were used. Doing so captures the most recent information, but it may also underreport errors. Since Nikita pinpoints errors on one's site, it is only natural that Webmasters fix those errors and so Nikita finds fewer problems on subsequent crawls of a site. Using the data from the first crawl tells us more about what a site is like before a Webmaster has had the benefit of Nikita's observations.
After discarding the crawls that didn't meet the criteria above, I was left with 1021 crawls of unique sites. From each of these, the program selected 25 pages at random for a corpus of 25525 pages. (25525 = 1021 × 25)
Nikita validates pages in either HTML (SGML) or XML mode. (You can read more about the implications of the validation mode and how Nikita chooses.) In the sample, 19140 (almost exactly ¾) were "XMLish", i.e. probably some variety of XHTML. The remainder were almost all HTML.
Nikita couldn't determine the validation mode for about .5% of pages. This happens when the pages are empty, unreadable, etc.
Of the XMLish pages, 8315 (43%) had no validation errors, while this was true for just 1372 (22%) of the HTML pages. Overall, 9687 of the pages (38%) were free of validation errors.
Some pages didn't even pass the most fundamental tests. Nikita found 92 (.3%) with encoding errors that made them unreadable and 579 (2%) that had no title. (The latter includes pages with empty <title> elements as well as those where the tag was missing entirely.)
Nikita examines error-prone headers for problems, and headers containing dates that don't conform to one of the formats in RFC 2616 §3.3 are particularly common.
Of the 7648 Expires headers sent in the sample, 964 (almost 13%) were malformed. Of these, 480 (about half) were either 0 or -1. Although RFC 2616 specifically reminds clients to watch out for 0 in the Expires header, it also clearly states that zero is invalid.
Last-Modified headers fared much better, with only 95 of 7809 (1%) containing invalid dates.
Last-Modified and Expires, along with E-Tag and Cache-Control, are caching headers. 16799 (66%) of pages were accompanied by at least one of these headers.
There are a total of 739510 validation messages in the sample containing over 9100 unique messages. I decided (somewhat arbitrarily) to show only the 25 most frequent messages in the table below. These 25 represent a little more than 43% of the sample.
The most frequent message 'reference not terminated by REFC delimiter'
usually comes from an unescaped ampersand (&, ASCII 38), and they're common in
URLs like this one:
Note that the message 'required attribute "ALT" not specified' appears twice, once with the attribute in uppercase and once in lowercase (an HTML/XHTML difference). If I was to combine these two entries, it would be the most common error message.
|Instances||Portion of Total||Message|
A page's encoding is sometimes also called its "character set" or "charset" for short.
In the pie chart and table below you can see that the vast majority of encoding declarations are either UTF-8 or ISO-8859-1. About ⅔ of the sample (67%) declare an encoding of UTF-8 and three in ten (30%) declare ISO-8859-1. The third most popular is Windows-1252 (an ISO-8859-1 superset) with a little over 1% of the total.
When no encoding is specified, Nikita usually defaults to ISO-8859-1 per HTTP rules.
In the table below, the total number of encodings found is greater than the number of pages in the sample because many pages declare multiple encodings. (Nikita will warn you if she finds multiple encodings declared for a page on your site.)
|Instances||Portion of Total||Encoding|
|* I counted the encodings "UTF-8" and "UTF8" together. There was only one occurrence of the latter.|
Web pages can declare their encoding in four different places: in the HTTP Content-Type header, inside the HTML in the META HTTP-EQUIV Content-Type tag, in the file's BOM, or in the XML declaration. (You might want to read about how Nikita divines a Web page's encoding from this jumble of information.)
In the pie chart and table below you can see that almost ⅔ of the encodings are specified via a META tag and most of the rest were declared via HTTP.
In the table below, the number of encoding sources is greater than the number of pages in the sample because many pages declare their encoding in multiple places. (Nikita will warn you if she finds multiple encodings declared for a page on your site.)
|Instances||Portion of Total||Source|
|22304||60.01%||META HTTP-equiv tag|
|13359||35.94%||HTTP response header|
|1093||2.94%||Fallback to default|
|411||1.11%||The file's BOM (byte order mark)|
There are 41 distinct doctypes in the sample. For purposes of discussing them here, I considered doctypes the same if they used the FPI (Formal Public Identifier). For example, these two doctypes were considered equivalent:
I chose not to display doctypes that represented < 1% of the sample.
In the pie chart and table below, you can see that XHTML doctypes represent more than ¾ of the sample, with HTML making up the remainder. Interestingly, transitional doctypes (both XHTML and HTML) dominate their respective fields.
In the table below, the number of doctypes sums to less than the number of pages because I ignored doctypes that represented < 1% of the total.
|Instances||Portion of Total||Source|
|11093||47.83%||XHTML 1.0 Transitional|
|7390||31.86%||XHTML 1.0 Strict|
|2517||10.85%||HTML 4.01 Transitional|
|1099||4.74%||HTML 4.01 Strict|
|603||2.60%||XHTML 1.1 Strict|
|492||2.12%||HTML 4.0 Transitional|
A media type is also sometimes called a "content type". nearly all of the pages in this sample (> 99.7%) use the text/html media type except for the two-tenths of one percent that use application/xhtml+xml.
Nikita's view of the world might be slightly skewed away from the application/xhtml+xml for a couple of reasons. First of all, Nikita sends an Accept header of */* and many servers capable of sending application/xhtml+xml might do so only if they find that exact string in the Accept header sent by the client.
Second, Nikita doesn't attempt to masquerade as a browser via her user agent string; it is simply "Nikita the Spider" and doesn't contain words like "Mozilla", "Opera", "WebKit", "KHTML", etc. that might convince servers that shes's capable of handling XHTML. It's likely that cautious servers choose to send text/html to Nikita.
The overwhelming dominance of text/html makes the pie chart and table below superfluous but they're here for completeness.
Unlike some of the other properties discussed in this article, there's a simple 1:1 relationship between pages and media types. Each page has exactly one which is specified in the HTTP Content-Type header. (It's possible to specify something else in an HTML META HTTP-EQUIV tag, but this is non-standard and of no practical use. Nikita doesn't look for media types specified in the page contents.)
|Instances||Portion of Total||Source|
The mean page size of the sample is 20278 bytes, the median is 13795 bytes, and the mode is 42. As Douglas Adams would be quick to point out, this answer shows the value of knowing what the question is.
It's very useful to visualize page size data. The graph below shows the percentage of pages ≤ the size given on the X axis. About ¼ of the pages were 8KiB or smaller, ½ were 13KiB or smaller, ¾ were 23KiB or smaller and almost 99% were smaller than 100KiB.
Here's a tabular look at the page sizes. The X axis of the graph is on a constant scale which facilitates interpretation but sacrifices detail for values ≤ 15k which is where a lot of the action happens. This table remedies that.
|Page size (KiB)||% of sizes ≤|
References from Nikita's perspective are any references to other entities. Nikita is aggresive about counting these because she wants to check everything that could possibly be broken. Therefore these counts include not just common <a href="..."> links, but also references to images, frames, scripts, CSS files, etc.
The mean reference count of pages in the sample is 62.9, the median is 47, and the mode is 37.
As with the page sizes above, it's especially helpful to visualize these data. The graph below shows that about ¼ of the pages had 29 or fewer references, ½ had 47 or fewer, ¾ had 75 or fewer and almost 94% had fewer than 150.
This graph has a long tail (not shown); to bring the curve up to 99% would require doubling the width of the graph.
Here's a tabular look at the reference counts. As mentioned above, the graph doesn't show the highest values for lack of space. This table shows what the graph can't.
Webmasters have a bit to learn about HTTP date formats given that more than one in ten Expires headers were wrong. Since RFC 2616 states that invalid dates are to be interpreted as "already expired", a misconfigured Web server can make an entire site uncacheable.
In addition to 0 and -1 being used in the Expires header, I often see this invalid value:
Sat, 1 Jan 2000 00:00:00 GMT
The "Full Date" section of RFC 2616 which describes valid date formats says that the day of the month must have two digits and shows an example to this effect. That's a detail that only a computer could care about, but computers (not humans) are precisely the target audience for HTTP headers.
The rest of these observations are essentially unchanged from the spring By The Numbers article, which means that Nikita's users haven't changed much in the past six months.
The most frequent validation messages are probably not a great surprise to anyone who has hand-coded HTML. It's unfortunately easy to forget to escape ampersands as well as the alt attribute on img tags. Validators are excellent tools for catching these sorts of mechanical mistakes.
The frequent use of META tags to specify encodings speaks to a common inability to set encodings via HTTP headers, or an unawareness of that ability.
The most surprising observation to me is the widespread use of the transitional doctypes. At the time these data were collected in late 2008, the transitional doctypes were already more than eight years old which is about half the lifetime of the Web itself. That's quite the lengthy transition...
I believe my sample is a sufficiently random representation of what Nikita sees, but what Nikita sees doesn't represent the Web as a whole. For one thing, Nikita has mostly been promoted in English-speaking venues. Also, it's reasonable to assume that those who ask Nikita to crawl their site are more aware of Web standards than the average Web site author. To quote How To Lie With Statistics, "The result of a sampling study is no better than the sample it is based on."
If you like this article, you can share it without fear of DMCA goons kicking down your door in the middle of the night. It is copyright Philip Semanchuk under an attribution, non-commercial, share-alike Creative Commons License.