Nikita is a spider with good manners; she checks the robots.txt file of every site she accesses. By itself, a single robots.txt file is not very interesting. But a survey of a large number of them can inform us about how this file is used. This survey examines how many sites use a robots.txt file, whether or not they're used correctly, and how often one is likely to encounter some “quirky” features of robots.txt.
Can one go so far as to call the assembled results interesting? I'll leave that for you to decide. My survey of robots.txt files from about 25,000 different hosts (collected in the Spring of 2006) is below.
In December of 2006, I added statistics about a survey of over 150,000 robots.txt files and in January of 2008, I added a survey covering all of 2007.
Robots.txt is kind of an odd duck. It is not an official standard blessed by the Internet Powers That Be, but it is very widely used and there has been good agreement on the format (if only shaky understanding of it) since it was adopted in 1994. Two documents assume the role of standards; Martijn Koster is the author of both. The first, which I refer to as MK1994, has become the de facto standard. The second (MK1996) is a clearer and richer document IMHO, but it doesn't carry as much weight with the Internet community as MK1994.
Of the sites I surveyed, 2.7% specified an Expires header along with their robots.txt file. Of those, 52.6% (1.4% of the total) specified immediate expiration of the file. The remainder specified expiration times ranging from 1 second to 1024 days.
It suprises me that so few sites take advantage of the Expires header. In the discussion leading up to the first robots.txt standard (which is no longer online, unfortunately), someone suggested that there should be some way to indicate the expiration date from within the file itself. Martijn Koster reminded him that HTTP's Expires header already served exactly that purpose, and someone else said, “I agree with Martijn that we should use the (painfully obvious) existing mechanism”. In other words, it seems likely that use of the Expires header wasn't mentioned in MK1994 because it was deemed self-evident. MK1996 states explicitly that robots should respect the Expires header and defines a default expiration of seven days if it is absent.
A distressing 7.7% of the sites responded with content labelled
as text/html. Since I couldn't examine all
these files individually, I assumed that any file that contained
<title or <html in the content was a Web page. By these
criteria, at least 91% the responses labelled text/html really
were HTML – presumably
some sort of “Ooops, that file is missing” Web page. (Some spot checks added strength to
this assertion.) The
Webmasters of these sites
need a gentle but firm whack with a clue stick. Requests for a resource that's not
found should return response code
404, especially for a file like robots.txt where the response code is a meaningful
part of
the specification. (A 404 received in response to a request for robots.txt means
that all robots are welcome.) For the record, none of sites that returned text/html
media gave a 404 response code.
Of the remainder labelled as text/html, most were ordinary robots.txt files
mislabelled as HTML content. (Don't put away
that clue stick yet!) The others were varying bits of mess: empty files, non-ASCII garbage,
etc.
Over half of the sites -- 51.7% -- didn't provide a robots.txt and responded to Nikita's query with a 404. This is perfectly valid and simply means that robots are permitted unrestricted access to the site. If we assume that the Web pages returned above should really return 404s, then the number of sites without a robots.txt file jumps to almost 60%.
Just .4% of the sites chose to use the part of MK1994 that says that 401 and 403 responses indicate that the site is off limits to all robots. And my guess is that some of these sites simply respond with a 401 or 403 for all files when the user agent looks like a robot's. In other words, this feature of the robots.txt spec is barely used.
Five sites (all of them at aol.com) return a 202 (accepted) response code, which makes no sense at all in this context. A few others return 302 redirects that eventually redirect back to themselves in an infinite loop. One returns a 300 (multiple choices) which is meaningless to a robot, and another returns a 550 response code for which I can't even find a definition. While these sites would be interesting for an article entitled “101 Ways to Misconfigure Your Web Server”, they're a vanishingly small portion of this sample and don't merit further attention here.
Robots.txt files containing non-ASCII are of particular interest to me because Nikita ran into a problem where Python’s robot exclusion rules parser crashes when confronted with non-ASCII under some circumstances. I wrote a new robots.txt parser to handle a wider range of robots.txt files.
Introducing the subject of non-ASCII also introduces the subject of encodings, on which MK1994 and MK1996 are silent. Fortunately, HTTP once again comes to the rescue. The HTTP specification says that text media has a default encoding of ISO-8859-1 (a superset of US-ASCII), so robots.txt files can legally contain ISO-8859-1 characters even if no encoding is specified via HTTP.
All but a tiny handful of the robots.txt files in the sample contain pure ASCII (handful being a scientific term defined as 0.2%). The fifty-five that don't can be divided into three categories. The first category is files that contain non-ASCII in the comments (e.g. “Det er ikke tillat med roboter, spidere og fremmede script på disse områdene”). Since a properly programmed spider ignores the comments, this category isn't too interesting.
The second category have non-ASCII in meaningful robots.txt fields. Oddly enough, almost all of these are a consequence of a robot called Hämähäkki (the Finnish word for spider) which appeared in a list of active robots in the mid-90s. The spider itself is gone, but a decade later, the name Hämähäkki lives on in robots.txt files. Apparently one or more automated tools built robots.txt files that listed all "known" spiders based on an outdated list at robottxt.org. Of the files in my sample containing non-ASCII, Hämähäkki was the only non-ASCII in thirty-two (over half) of them. Robots that aren't prepared to handle non-ASCII might have trouble with these. Posthumous kudos to Hämähäkki for keeping us on our toes.
The third category of non-ASCII robots.txt files are those that contain
a BOM -- I found just twelve
of these. This is another area
where Python's robots.txt parser can get confused, and I suspect that it is not the only
code library to have this weakness.
The problem is that parsers that fail to account for the BOM see it as part of the
first line of text. If that line is a comment (which it often is) then the BOM won't cause
any problems. But if, for instance, the
file consists of a UTF-8 BOM followed by a simple disallow-all rule, then some parsers
might see this (the BOM is bolded):
User-agent: *
To a parser, the user-agent line might just look like garbage, and as a result
the Disallow line after it would be ignored. So the parser would see an “empty” robots.txt
file and
permit access to the entire site, which is exactly the opposite of what the author intended.
Disallow: /
At minimum, robots.txt parsers should not allow BOMs to interfere with proper parsing of
the file. Ideally, the parser would use the BOM as it was intended: to indicate the encoding
of the file. Given that non-ASCII robots.txt files are so rare, I expect that support
for them among code libraries is weak and support for BOMs is probably weaker still.
Any
Webmaster who codes a robots.txt file that contains non-ASCII and relies on proper
interpretation of the BOM to decode it is asking for trouble. (Not to mention the
fact that
doing so violates the HTTP specification which says in section 3.7.1, Data in character
sets other than ‘ISO-8859-1’ or its subsets MUST be labeled with an appropriate charset
value.
)
It is worth noting that under Windows 2000 (and probably Windows XP), Notepad adds
a BOM if you save a text file as UTF-8 or Unicode. You can see this for yourself
by using a hex viewer for Windows
or using hexdump under Unix.
Google's robots.txt parser supports wildcards in pathnames. It seems likely that other bots support this too, but Google is the only one for which I have a reference.
A number of robots (among them
Yahoo Slurp,
Inktomi and
MSNBot) support
a Crawl-Delay: n specification, where n is the
number of seconds that a bot should wait between requesting pages. In my sample,
1.4% of robots.txt files contained a Crawl-Delay specification.
I was curious about what crawl delays Webmasters choose, and I found that the numbers vary widely. In the 360 files that contained crawl delays, there were 514 delays specified. The minimum delay was 1 second, the maximum 172800 (which is 48 hours), the mean 890.34, the median 10 and the mode 1. Since this is such a broad range of data with some big numbers that skew the average, it's helpful to look at the percentage of crawl delays less than or equal to a given value. The table below shows exactly that. For instance, 62% of all of the crawl delays were ≤ 15 seconds.
| Delay (seconds) | % of delays ≤ |
|---|---|
| 1 | 25% |
| 2 | 30% |
| 3 | 32% |
| 5 | 39% |
| 10 | 59% |
| 15 | 62% |
| 20 | 70% |
| 30 | 80% |
| 60 | 87% |
| 120 | 95% |
| 900 | 99% |
The table below summarizes the frequency of the items discussed above. The figures
given are from my sample of robots.txt files from 25,060 different hosts. “Different”,
in this case, was determined by a case-insensitive string comparison of the non-path
portion of the URL. For example, news.example.com was considered different from
www.example.com.
| Feature | Occurrences | Percentage |
|---|---|---|
| Expires header present | 689 | 2.7% |
| Return text/html | 1937 | 7.7% |
| Return 404 | 12958 | 51.7% |
| Return 401 or 403 | 116 | 0.4% |
| Contain non-ASCII | 55 | 0.2% |
| Contain a BOM | 12 | < 0.1% |
| Specify Crawl-Delay | 360 | 1.4% |
I sampled these robots.txt files as part of pre-alpha testing of Nikita the Spider. The sample includes the sites I spidered and they sites that they link to (because Nikita fetches robots.txt before checking a link). Whether or not there was a bias in my sample, I cannot say. Actually, I can't think of a way of building a sample that doesn't contain at least some bias. I hope that 25,000 files is a sample large enough to smooth out the inherent bias and from which to draw conclusions.
Speaking of conclusions, I have two contradictory ones. First, an observation: robots.txt hasn't changed much from its 1994 origins. 99.8% of them present on the Net today are pure ASCII, few extensions have been made to the original format and some of the elements of the original format (such as the use of 401 and 403 response codes) are barely used. Based on this, the fact that 50–60% of sites don't even bother with a robots.txt file, and on the number of bungled robots.txt files out there, one could conclude that the original specification wasn't very good. But I prefer the opposite conclusion: the original specification was a good one and still gets the job done. And with a little more respect for the existing “features” implied by HTTP (the Expires header and encoding specification) and the widespread acceptance of crawl-delay (which seems quite useful), the format might survive another ten years without further alteration.
Thanks for reading! Comments are welcome.
If you like this article, you can share it without fear of DMCA goons kicking down your
door in the middle of the night. It is copyright Philip Semanchuk
under a non-commercial, share-alike
Creative Commons
License.