Download robotexclusionrulesparser.py
You can skip all of my blah blah blah and go to the usage notes.
Python's
robotparser
module does a fine job with 99.8% of the robots.txt files out there. But
I ran into some that it didn't handle and the
problem (non-ASCII characters in the robots.txt)
was one that I felt I could not easily hack the module to accommodate. In response to that, I
wrote this module together in the Spring of 2006. It is not a drop-in replacement
for robotparser (some of the function names are different) but it works in much
the same way.
"Does it comply with the spec?" is a trick question in this case; there is no real spec. The most recent formal robots.txt format proposal was published in 1996 (urk!) and that was only a draft which was never sanctified. (It even says clearly at the top, "It is inappropriate to use Internet-Drafts as reference material...") Even specs that have gone through a full review and comment process can be open to interpretation, so it's no surprise that the robots.txt draft spec has some holes. Actually, it is surprisingly complete, considering.
For purposes of this discussion, I'll refer to the Martijn Koster's 1994 document as MK1994 and his 1996 draft spec as MK1996.
Side note: For those inclined to sneer at the idea of treating an unblessed draft
(MK1996) as a spec, keep in mind that MK1994 "represents a consensus [ed: of six people]
on 30 June 1994 on the
robots mailing list", which sounds much less official than an IETF draft document.
Interestingly enough, the 1994 robots.txt mailing list
discussion is still online and there's even Python source code for a
robots.txt parser posted on 21 Jun 1994 by some guy from the Netherlands. I wonder where
that guy is today. Probably toiling away in obscurity. ;-)
This code is intended to interpret all robots.txt files written according to MK1994/96. In particular, this module supports the following lesser-known parts of MK1994/96:
The vast majority of robots.txt tutorials and the like make no mention of the features
introduced in MK1996 (like Allow: fields). Furthermore, many insist that end-of-line markers
must be Unix-style \n even though it is clearly stated in MK1994 and MK1996 that
\r, \n and \r\n are all acceptable. Even such luminaries
as Wikipedia,
Microsoft and
the W3C
seem unaware of MK1996, although they are willing
to quote the older and less formal MK1994. Hrrmph.
In the spirit of "be generous in what you accept", this module also handles some things that are invalid according to the specs.
The most signficant exception this module permits is the presence of non-ASCII
characters in the significant fields of robots.txt. ("Significant"
here means "anything outside of a comment".) MK1994 doesn't address the subject of
non-ASCII or encodings, but MK1996 (in Section 3.3, "Formal Syntax")
makes it clear that only ASCII characters are allowed in significant
fields. (Side note – actually only a subset of printable ASCII characters are allowed;
but you'll have to read the spec yourself to get the gory details.)
However, a survey of real-world robots.txt files
shows that about one in every thousand includes non-ASCII in significant fields. Python's
robotparser module rolls over and dies when it encounters these. In
contrast, this module
attempts to decode the file using (a) the encoding specified in the HTTP Content-Type header
sent with robots.txt (if present, which is rare) or (b) a default of ISO-8859-1 as per the
HTTP spec RFC 2616. This solves the encoding problems for nearly all non-ASCII robots.txt files.
Also, RobotExclusionRulesParser accepts "user-agent" or "useragent" in robots.txt
whereas MK1994/96 only permit the former.
MK1996 contradicts MK1994 somewhat. MK1994 (which only defines
disallows) says that a blank path indicates nothing is disallowed. MK1996 (which
defines both allows and disallows) doesn't permit blank paths (the minimal path is a
single slash: /) but
doesn't mention anything about this change in the section on backwards compatibility. AFAIK
blank paths are still widely used in Disallow lines which is consistent with the fact that
most of the Net seems to ignore MK1996 and regard MK1994 as the de facto standard.
In the absence of other guidance, this code interprets blank disallow lines as meaning nothing is disallowed and for consistency interprets blank allow lines as meaning nothing is allowed. Thus, these two rules mean the same thing:
# Disallow everything
User-agent: foobot
Disallow: /
# Allow nothing
User-agent: foobot
Allow:
Yahoo Slurp, Inktomi, MSNBot and perhaps some other bots support a Crawl-Delay: n specification, where n is the number of seconds that a bot should wait between requesting pages. This module doesn't support that extension to MK1994/96.
Also, this module doesn't support any of Google's extensions to the robots.txt standards.
robotparserFirst of all, thanks to Bastian Kleineidam for writing Python's
robotparser module.
Parts of this module were inspired (directly and indirectly) by his work. Here's a list
of differences between robotparser and my robotexclusionrulesparser
in rough order of significance.
is_expired() makes use of this date; see the
usage notes for more information. Consequently, this module dispenses with the
modified() and mtime() functions provided by
robotparser. However, I deliberately left the instance variable
expiration_date easily accessible in case you want to mess with it.
robotparser. MK1996 is mostly non-committal on this topic, so
handling of these codes is somewhat implementation dependent. IMHO,
robotparser complies with the letter but not the spirit of MK1996 with
regards to handling error codes. MK1996 says, "On the request attempt resulted in
temporary failure [sic] a robot should defer visits to the site until such
time as the resource can be retrieved". It also says that that "is not required...[but
is] recommended". robotparser handles those errors internally (e.g. a 503
Service Unavailable is interpreted as "allow all");
this module punts such errors up to the caller so that she can
decide how to handle them.
robotparser's handling of robots.txt files that
contain a BOM (byte order mark). It doesn't make any accomodation for them, so
it might see the first line of a robots.txt file with a UTF-8 BOM as this:User-agent: foobot
The bug can have significant consequences when robots.txt consists of this:
[BOM]User-agent: *
Disallow: /
The user-agent line will be seen as garbage and so the disallow rule will be
ignored. The result will be that all robots will be permitted everywhere which is
the exact opposite of what the robots.txt author intended.
Robotexclusionrulesparser doesn't get confused by BOMs; it simply
ignores them.
parse() function accepts a
string; that string can be Unicode. If it isn't Unicode, it's decoded to Unicode
using ISO-8859-1.
response_code attribute that
reports (what else?) the response code when a robots.txt file is fetched from a
remote server.
robotparser's handling of paths that contain a %-encoded
forward slash; MK1996 says that they shouldn't be translated but the robotparser module
does. This module replaces that bug with newer, more interesting bugs. =)
The RobotExclusionRulesParser class exposes four functions and four attributes.
General usage is to call fetch() to set up the parser and then call
is_allowed(). If your code is long-running, you'll also want to call
is_expired() occasionally. Everything else is non-essential.
Fetch robots.txt from the URL provided. This function also sets the
expiration_date attribute.
Parse the content of a robots.txt file. This is useful if your robots.txt file isn't HTTP-accessible, or if you just want to experiment. The unit tests make heavy use of this function.
Return a boolean indicating whether or not the given user agent is allowed to visit
the URL. The user agents listed in robots.txt only need be present as a substring in
the UserAgent parameter for this function to match them; the comparison is
case-insensitive. e.g. passing a UserAgent of Mozilla/5.0 (compatible;
Foobot/2.1) would match the user agent rule foobot.
The scheme and authority are discarded from the URL when comparing it to robots.txt
rules. (e.g. http://www.example.com/foo/bar.html becomes
/foo/bar.html.) This is the way you want it to work -- the rules in
robots.txt don't specify scheme and authority themselves, so one can't match against
them.
Return a boolean indicating whether or not the parser has passed its expiration
date (the dreaded "not-so-fresh" feeling). The expiration date is set when you
call fetch() either by reading the HTTP Expires header or by using
a default of seven days. See also the related attributes expiration_date and
use_local_time.
Send this user agent string to the server when fetching robots.txt. If left blank, Python's default is used.
This read-only attribute reports the URL that you used in the most recent call
to fetch(). This is useful when the
parser's expiration date passes because you can just call
parser.fetch(parser.source_url) to refresh the parser.
A boolean that tells RobotExclusionRulesParser whether
expiration_date should be in local time or UTC (a.k.a. Greenwich
Mean Time).
Since expiration_date is set when you call fetch(), you
must set use_local_time before calling
fetch() for it to have any effect.
If you only call is_expired() and never look at
expiration_date, you can leave use_local_time at its
default (True).
A timestamp that states when the robots.txt contents are out of date. The timestamp
is a Unix-style timestamp; i.e. a float counting the number of seconds since the
epoch. The function is_expired() will compare this to "now" for you.
The response code received during the last fetch from a remote server, or None
if fetch has not been called. When using Python ≤ 2.3, this information is
less precise. It is set to 200 if the fetch is successful or None otherwise. (Older
versions of Python don't provide this information and so I have to fake it.)
This attribute is read-only.
Users of this module should be aware that it raises a few exceptions. Some of them are impossible to finesse internally (Unicode errors, for instance). Others are deliberately exposed because handling them is outside of the scope of the robots.txt specifications and thus outside of the scope of this module. (If, for instance, a robots.txt is present but an error occurs during its transmission.)
The first exception explicitly raised by this code is a Unicode exception
(some flavor of UnicodeError). You
can see that in two different situations. First, if the parser fetches a robots.txt file that
can't be
decoded using the encoding specified in the HTTP response header. (That encoding defaults to
ISO-8859-1 which is a superset of US-ASCII which is what > 99.9% of existing robots.txt
files use.) Second, you'll see a Unicode exception
if you feed a non-Unicode string (i.e. isinstance(YourString, unicode) ==
False) to parse() and that string can't be decoded using ISO-8859-1.
The second exception explicitly raised by this code is a urllib2.URLError
exception. fetch()
uses urllib2.urlopen() and if that function raises an exception, the
exception is passed up to the caller after being massaged to make it a little nicer to deal
with.
Note that not all non-200 response codes raise an exception. Those for which MK1996 defines specific actions are handled internally –
If the RobotExclusionRulesParser raises a URLError exception that the caller
decides isn't fatal (e.g. the response code 410 Gone), she can just call
parser.parse("") and use the parser as normal.
Note that although urllib2 handles most redirects by itself, urllib2 can return 301/302 as the response code if the server generates an infinite loop of 301/302 redirects. Users of this module should be prepared to handle that response code.
These aren't the only exceptions that you might see, they're just the ones that
the code raises explicitly. Another likely source for exceptions is the
unicode() function. The function fetch() gets the encoding from
the Content-Type header that comes with the robots.txt file. That encoding gets passed directly
to unicode(). When I first started using unicode() I naïvely
expected that an encoding Python didn't understand would be bounced back as a LookupError.
Apparently I wasn't alone in thinking
that; Python
bug 960874 was filed for that reason. But it was closed as no bug/no fix with the
explanation that, it is not guaranteed that you will only see LookupErrors (the same is
true for most other Python APIs, e.g. most can generate MemoryErrors). Possible other errors
are ValueErrors, NameErrors, ImportErrors, etc. etc.
. Explanations like this make me
long for a construct like Java's throws keyword. :-/
You also need to watch out for a variety of exceptions from fetch() because it
calls urllib2 which calls httplib which calls socket. I repackage some of the exceptions
but others get raised up to the caller untouched. Experience has taught me to watch out
for
httplib.BadStatusLine, socket.error and socket.timeout.
You also need to handle Python bug 900744
which causes httplib to raise ValueError in some cases. This affects
Python 2.4 but not 2.3. AFAIK there is no workaround other than to apply the patch supplied.
Last but not least, there's a bug in
urllib2 that raises OSError on rare occasions. This has been fixed but
as of Python 2.5.1 the patch has not yet been integrated.
For more examples, see parser_test.py.
import robotexclusionrulesparser
rerp = robotexclusionrulesparser.RobotExclusionRulesParser()
# I'll set the (optional) UserAgent before calling fetch.
rerp.user_agent = "Foobot/2.1 (See http://example.net/foobot.html for info)"
# Note that there should be a try/except here to handle urllib2.URLError,
# socket.timeout, UnicodeError, etc.
rerp.fetch("http://www.example.org/robots.txt")
for UserAgent, url in ListOfUserAgentAndUrlPairs:
print "Can %s fetch %s? %s" % \
(UserAgent, url, rerp.is_allowed(UserAgent, url))
response_code attribute.asctime() would have their expiration date
calculated incorrectly. Since only 3% of robots.txt files come with an
Expires header and use of the old date format is rare, this bug probably
affected very few people.
fetch() to protect against Webmasters
who supply an ISO or some such instead of a robots.txt. (It happens...)
unicode
to the top of is_allowed() to make it easier for callers to debug Unicode
problems should they arise.
_ParseContentTypeHeader()
to accept quoted values in the parameter field of HTTP content type headers so that an encoding
like charset="utf-8" is now interpreted properly.
source_url read-only. This isn't a
functional change, it just enforces what was formerly just conceptual.
_ParseContentTypeHeader() more robust when
dealing with malformed Content-Type headers.
unicode(). __str__() to return a UTF-8 encoded string so that it won't
fail when the robots.txt content is non-ASCII.
.lower() in _ParseContentTypeHeader().This code is copyright Philip Semanchuk under the Gnu Public License.
Comments, bug reports, etc. are most welcome.