This is an update to my original article about the statistics Nikita had collected on the robots.txt files from some 25,000 different hosts that I wrote in the Spring of 2006. Over the past seven months, Nikita has visited over 150,000 hosts (151,109 to be exact) and this article presents the statistics gathered from those sites' robots.txt files. You'll need to read the original article to understand this one. One interesting point to note is that many of the numbers (such as the percentage of sites specifying immediate expiration of their robots.txt) are unchanged from the previous survey.
Of the sites I surveyed, 2.8% specified an Expires header along with their robots.txt file. Of those, 51.5% (1.4% of the total) specified immediate expiration of the file.
A distressing 7.9% of the sites responded with content labelled
text/html. Since I couldn't examine all
these files individually, I assumed that any file that contained
<html in the content was a Web page. By these
criteria, 91% the responses labelled
were HTML – presumably
some sort of “Ooops, that file is missing” Web page. (Some spot checks added strength to
this assertion.) The
Webmasters of these sites
need a gentle but firm whack with a clue stick. Requests for a resource that's not
found should return response code
404, especially for a file like robots.txt where the response code is a meaningful
the specification. (A 404 received in response to a request for robots.txt means
that all robots are welcome.)
Over half of the sites -- 52.2% -- didn't provide a robots.txt and responded to Nikita's query with a 404. This is perfectly valid and simply means that robots are permitted unrestricted access to the site. If we assume that the Web pages returned above should really return 404s, then the number of sites without a robots.txt file jumps to over 60%.
Just .4% of the sites chose to use the part of MK1994 that says that 401 and 403 responses indicate that the site is off limits to all robots. My guess is that some of these sites simply respond with a 401 or 403 for all files when the user agent looks like a robot's. In other words, this feature of the robots.txt spec is barely used.
.3% of the sites Nikita visited returned response codes that don't make sense in context. Some examples include 202 (Accepted), 204 (No Content), 205 (Reset Content), 400 (Bad Request), 406 (Not Acceptable), 410 (Gone), and 456 (Go Away Before I Taunt You A Second Time). While these sites would be interesting for an article entitled “101 Ways to Misconfigure Your Web Server”, they're a very small portion of this sample and don't merit further attention here.
Nearly all of the robots.txt files in the sample are pure ASCII; only .27% (401 of them) are not. Of these 401 non-ASCII robots.txt files, 126 (about 31%) restrict their non-ASCII to harmless comment fields. Another 28% attribute their non-ASCII to good old Hämähäkki, the gone-but-not-forgotten Finnish spider from the mid-1990s as described in my original article. Another 20% are non-ASCII because they contain a BOM, and last but not least, about 8% (34, or about 1 in every 5,000 of the total sample) contained non-ASCII in significant fields. By "signficant" I mean fields that a robots.txt interpreter would be likely to read.
The remainder of the robots.txt files that contained non-ASCII were mostly useless as robots.txt files. Some sent other content (like HTML or GIFs) mislabelled as text/plain, while a few sent robots.txt files that had other content appended. This last category included one file with PHP code attached that included a database userid and password. (The robots.txt for that site has since been corrected.)
In this sample, 1.8% of robots.txt files contained a Crawl-Delay specification. The minimum delay was 1 second and the maximum was 4 million seconds (which is 46 days). This last value was from a site that specified dozens of crawl delays with values in the millions. This site and one other frustrated Webmaster skewed the crawl-delay data quite a bit, so I discarded the outliers (all values >= 1 million) from these two sites. That done, the maximum delay became 172800 seconds (48 hours), the mean 535.22, the median and mode both 10.
To get an idea of what crawl delays are common, it's helpful to look at the percentage of crawl delays less than or equal to a given value. The table below shows exactly that. For instance, 66% of all of the crawl delays were ≤ 20 seconds. (The percentages are rounded to the nearest integer.) The data in this table are the only numbers that deviate significantly from those of Spring 2006.
|Delay (seconds)||% of delays ≤|
The table below summarizes the frequency of the items discussed above. The figures
given are from my sample of robots.txt files from 151,109 different hosts. “Different”,
in this case, was determined by a case-insensitive string comparison of the non-path
portion of the URL. For example,
news.example.com was considered different from
|Expires header present||4205||2.8%|
|Return 401 or 403||622||0.4%|
|Contain a BOM||82||0.05%|
I'm most surprised that almost every single statistic is nearly unchanged from the spring survey. Only the crawl delays have changed, and since many crawl delays can be specified in one robots.txt file, those statistics are more susceptible to influence by just a few Webmasters.
Thanks for reading! Comments are welcome.
If you like this article, you can share it without fear of DMCA goons kicking down your door in the middle of the night. It is copyright Philip Semanchuk under a non-commercial, share-alike Creative Commons License.