Nikita the Spider

A Robots.txt Survey Data for Late 2006

This is an update to my original article about the statistics Nikita had collected on the robots.txt files from some 25,000 different hosts that I wrote in the Spring of 2006. Over the past seven months, Nikita has visited over 150,000 hosts (151,109 to be exact) and this article presents the statistics gathered from those sites' robots.txt files. You'll need to read the original article to understand this one. One interesting point to note is that many of the numbers (such as the percentage of sites specifying immediate expiration of their robots.txt) are unchanged from the previous survey.

Expires Header

Of the sites I surveyed, 2.8% specified an Expires header along with their robots.txt file. Of those, 51.5% (1.4% of the total) specified immediate expiration of the file.

Good Intentions, Bad Responses

A distressing 7.9% of the sites responded with content labelled as text/html. Since I couldn't examine all these files individually, I assumed that any file that contained <title or <html in the content was a Web page. By these criteria, 91% the responses labelled text/html really were HTML – presumably some sort of “Ooops, that file is missing” Web page. (Some spot checks added strength to this assertion.) The Webmasters of these sites need a gentle but firm whack with a clue stick. Requests for a resource that's not found should return response code 404, especially for a file like robots.txt where the response code is a meaningful part of the specification. (A 404 received in response to a request for robots.txt means that all robots are welcome.)

404s

Over half of the sites -- 52.2% -- didn't provide a robots.txt and responded to Nikita's query with a 404. This is perfectly valid and simply means that robots are permitted unrestricted access to the site. If we assume that the Web pages returned above should really return 404s, then the number of sites without a robots.txt file jumps to over 60%.

401s and 403s

Just .4% of the sites chose to use the part of MK1994 that says that 401 and 403 responses indicate that the site is off limits to all robots. My guess is that some of these sites simply respond with a 401 or 403 for all files when the user agent looks like a robot's. In other words, this feature of the robots.txt spec is barely used.

Incomprehensible Responses

.3% of the sites Nikita visited returned response codes that don't make sense in context. Some examples include 202 (Accepted), 204 (No Content), 205 (Reset Content), 400 (Bad Request), 406 (Not Acceptable), 410 (Gone), and 456 (Go Away Before I Taunt You A Second Time). While these sites would be interesting for an article entitled “101 Ways to Misconfigure Your Web Server”, they're a very small portion of this sample and don't merit further attention here.

Non-ASCII Characters

Nearly all of the robots.txt files in the sample are pure ASCII; only .27% (401 of them) are not. Of these 401 non-ASCII robots.txt files, 126 (about 31%) restrict their non-ASCII to harmless comment fields. Another 28% attribute their non-ASCII to good old Hämähäkki, the gone-but-not-forgotten Finnish spider from the mid-1990s as described in my original article. Another 20% are non-ASCII because they contain a BOM, and last but not least, about 8% (34, or about 1 in every 5,000 of the total sample) contained non-ASCII in significant fields. By "signficant" I mean fields that a robots.txt interpreter would be likely to read.

The remainder of the robots.txt files that contained non-ASCII were mostly useless as robots.txt files. Some sent other content (like HTML or GIFs) mislabelled as text/plain, while a few sent robots.txt files that had other content appended. This last category included one file with PHP code attached that included a database userid and password. (The robots.txt for that site has since been corrected.)

Crawl-delay

In this sample, 1.8% of robots.txt files contained a Crawl-Delay specification. The minimum delay was 1 second and the maximum was 4 million seconds (which is 46 days). This last value was from a site that specified dozens of crawl delays with values in the millions. This site and one other frustrated Webmaster skewed the crawl-delay data quite a bit, so I discarded the outliers (all values >= 1 million) from these two sites. That done, the maximum delay became 172800 seconds (48 hours), the mean 535.22, the median and mode both 10.

To get an idea of what crawl delays are common, it's helpful to look at the percentage of crawl delays less than or equal to a given value. The table below shows exactly that. For instance, 66% of all of the crawl delays were ≤ 20 seconds. (The percentages are rounded to the nearest integer.) The data in this table are the only numbers that deviate significantly from those of Spring 2006.

Delay (seconds)% of delays ≤
1 10%
2 13%
3 16%
5 21%
9 22%
1055%
1560%
2066%
3072%
6082%
12088%
300092%
400099.9%

Summary of Figures

The table below summarizes the frequency of the items discussed above. The figures given are from my sample of robots.txt files from 151,109 different hosts. “Different”, in this case, was determined by a case-insensitive string comparison of the non-path portion of the URL. For example, news.example.com was considered different from www.example.com.

FeatureOccurrencesPercentage
Expires header present42052.8%
Return text/html118567.9%
Return 4047884352.2%
Return 401 or 4036220.4%
Contain non-ASCII4010.27%
Contain a BOM820.05%
Specify Crawl-Delay27661.8%

Conclusions

I'm most surprised that almost every single statistic is nearly unchanged from the spring survey. Only the crawl delays have changed, and since many crawl delays can be specified in one robots.txt file, those statistics are more susceptible to influence by just a few Webmasters.

Thanks for reading! Comments are welcome.