This is the third and most recent article in series that began with an article on robots.txt files in the spring of 2006. That article gives you the background on the statistics in this survey of 177,930 unique hosts from which Nikita requested a robots.txt between 01 Jan 2007 and 31 Dec 2007. The statistics here are quite similar to those from 2006.
Of the sites in the survey, 2.9% specified an Expires header along with their robots.txt file. Of those, 51.6% (1.5% of the total) specified immediate expiration of the file.
A distressing 8% of the sites responded with content labelled
as text/html
. Since I couldn't examine all
these files individually, I assumed that any file that contained
<title
or <html
in the content was a Web page. By these
criteria, 87% the responses labelled text/html
really
were HTML – presumably
some sort of “Ooops, that file is missing” Web page. The Webmasters of these sites
need a gentle but firm whack with a clue stick. Requests for a resource that's not
found should return response code
404, especially for a file like robots.txt where the response code is a meaningful
part of
the specification. (A 404 received in response to a request for robots.txt means
that all robots are welcome.)
Just under half of the sites -- 46.6% -- didn't provide a robots.txt and responded to Nikita's query with a 404. This is perfectly valid and simply means that robots are permitted unrestricted access to the site. This is a small but significant change from the Winter 2006 survey which found 52% of sites returning a 404.
Just .4% of the sites chose to use the part of MK1994 that says that 401 and 403 responses indicate that the site is off limits to all robots. My guess is that some of these sites simply respond with a 401 or 403 for all files when the user agent looks like a robot's. In other words, this feature of the robots.txt spec is barely used.
Just .1% of the sites Nikita visited returned response codes that don't make sense in context. Some examples include 201 (Created), 202 (Accepted), 204 (No Content), 400 (Bad Request), 406 (Not Acceptable), 410 (Gone), and 666 (Abandon Hope All Ye Who Enter Here). While these sites would be interesting for an article entitled “101 Ways to Misconfigure Your Web Server”, they're a very small portion of this sample and don't merit further attention here.
Nearly all of the robots.txt files in the sample are pure ASCII. 1033 robots.txt files (.58% of the sample) contained non-ASCII which is more than double the number found in the Winter 2006 survey. However, 472 of the 1033 were very similar files (with non-ASCII in a comment header) from the same root domain (alibaba.com). Ignoring these leaves 563 (about .3%) of the sample with non-ASCII which is very close to the Winter 2006 figure.
Of these 563 non-ASCII robots.txt files, only 84 contained non-ASCII in significant fields, several of them due to our old friend Hämähäkki. Others (108, or .06%) were non-ASCII due to the presence of a BOM, a few contained garbage instead of robots.txt data, and most merely had non-ASCII in the comments.
In this sample, 1.8% of robots.txt files contained a Crawl-Delay specification. The minimum delay was 0 seconds and the maximum was 3.33E+072 seconds (which means "don't return until the sun goes dark"). If I discard outlier values (which I defined as all values > 86400, or one day), then the mean of the remaining data is 723, and the median and mode are both 10.
To get an idea of what crawl delays are common, it's helpful to look at the percentage of crawl delays less than or equal to a given value. The table below shows exactly that. For instance, 66% of all of the crawl delays were ≤ 20 seconds. (The percentages are rounded to the nearest integer.)
Delay (seconds) | % of delays ≤ |
---|---|
1 | 4% |
2 | 7% |
3 | 8% |
5 | 14% |
9 | 14% |
10 | 59% |
15 | 61% |
20 | 66% |
30 | 73% |
60 | 79% |
120 | 87% |
3000 | 90% |
4000 | 99% |
The table below summarizes the frequency of the items discussed above. The figures
given are from my sample of robots.txt files from 177,930 different hosts. “Different”,
in this case, was determined by a case-insensitive string comparison of the non-path
portion of the URL. For example, news.example.com
was considered different from
www.example.com
.
Feature | Occurrences | Percentage |
---|---|---|
Expires header present | 5241 | 2.9% |
Return text/html | 14249 | 8.0% |
Return 404 | 82975 | 46.6% |
Return 401 or 403 | 771 | 0.4% |
Contain non-ASCII | 1033 | 0.58% |
Contain a BOM | 108 | 0.06% |
Specify Crawl-Delay | 3225 | 1.8% |
Thanks for reading! Comments are welcome.
If you like this article, you can share it without fear of DMCA goons kicking down your
door in the middle of the night. It is copyright Philip Semanchuk
under a non-commercial, share-alike
Creative Commons
License.