Nikita the Spider

A Robots.txt Survey Data for 2007

This is the third and most recent article in series that began with an article on robots.txt files in the spring of 2006. That article gives you the background on the statistics in this survey of 177,930 unique hosts from which Nikita requested a robots.txt between 01 Jan 2007 and 31 Dec 2007. The statistics here are quite similar to those from 2006.

Expires Header

Of the sites in the survey, 2.9% specified an Expires header along with their robots.txt file. Of those, 51.6% (1.5% of the total) specified immediate expiration of the file.

Good Intentions, Bad Responses

A distressing 8% of the sites responded with content labelled as text/html. Since I couldn't examine all these files individually, I assumed that any file that contained <title or <html in the content was a Web page. By these criteria, 87% the responses labelled text/html really were HTML – presumably some sort of “Ooops, that file is missing” Web page. The Webmasters of these sites need a gentle but firm whack with a clue stick. Requests for a resource that's not found should return response code 404, especially for a file like robots.txt where the response code is a meaningful part of the specification. (A 404 received in response to a request for robots.txt means that all robots are welcome.)

404s

Just under half of the sites -- 46.6% -- didn't provide a robots.txt and responded to Nikita's query with a 404. This is perfectly valid and simply means that robots are permitted unrestricted access to the site. This is a small but significant change from the Winter 2006 survey which found 52% of sites returning a 404.

401s and 403s

Just .4% of the sites chose to use the part of MK1994 that says that 401 and 403 responses indicate that the site is off limits to all robots. My guess is that some of these sites simply respond with a 401 or 403 for all files when the user agent looks like a robot's. In other words, this feature of the robots.txt spec is barely used.

Incomprehensible Responses

Just .1% of the sites Nikita visited returned response codes that don't make sense in context. Some examples include 201 (Created), 202 (Accepted), 204 (No Content), 400 (Bad Request), 406 (Not Acceptable), 410 (Gone), and 666 (Abandon Hope All Ye Who Enter Here). While these sites would be interesting for an article entitled “101 Ways to Misconfigure Your Web Server”, they're a very small portion of this sample and don't merit further attention here.

Non-ASCII Characters

Nearly all of the robots.txt files in the sample are pure ASCII. 1033 robots.txt files (.58% of the sample) contained non-ASCII which is more than double the number found in the Winter 2006 survey. However, 472 of the 1033 were very similar files (with non-ASCII in a comment header) from the same root domain (alibaba.com). Ignoring these leaves 563 (about .3%) of the sample with non-ASCII which is very close to the Winter 2006 figure.

Of these 563 non-ASCII robots.txt files, only 84 contained non-ASCII in significant fields, several of them due to our old friend Hämähäkki. Others (108, or .06%) were non-ASCII due to the presence of a BOM, a few contained garbage instead of robots.txt data, and most merely had non-ASCII in the comments.

Crawl-delay

In this sample, 1.8% of robots.txt files contained a Crawl-Delay specification. The minimum delay was 0 seconds and the maximum was 3.33E+072 seconds (which means "don't return until the sun goes dark"). If I discard outlier values (which I defined as all values > 86400, or one day), then the mean of the remaining data is 723, and the median and mode are both 10.

To get an idea of what crawl delays are common, it's helpful to look at the percentage of crawl delays less than or equal to a given value. The table below shows exactly that. For instance, 66% of all of the crawl delays were ≤ 20 seconds. (The percentages are rounded to the nearest integer.)

Delay (seconds)% of delays ≤
1 4%
2 7%
3 8%
5 14%
9 14%
1059%
1561%
2066%
3073%
6079%
12087%
300090%
400099%

Summary of Figures

The table below summarizes the frequency of the items discussed above. The figures given are from my sample of robots.txt files from 177,930 different hosts. “Different”, in this case, was determined by a case-insensitive string comparison of the non-path portion of the URL. For example, news.example.com was considered different from www.example.com.

FeatureOccurrencesPercentage
Expires header present52412.9%
Return text/html142498.0%
Return 4048297546.6%
Return 401 or 4037710.4%
Contain non-ASCII10330.58%
Contain a BOM1080.06%
Specify Crawl-Delay32251.8%

Thanks for reading! Comments are welcome.