Nikita the Spider

A Robot Exclusion Rules Parser for Python

Download robotexclusionrulesparser.py

Download parser_test.py

You can skip all of my blah blah blah and go to the usage notes.

Why?

Python's robotparser module does a fine job with 99.8% of the robots.txt files out there. But I ran into some that it didn't handle and the problem (non-ASCII characters in the robots.txt) was one that I felt I could not easily hack the module to accommodate. In response to that, I wrote this module together in the Spring of 2006. It is not a drop-in replacement for robotparser (some of the function names are different) but it works in much the same way.

Compliance with Published Specifications

"Does it comply with the spec?" is a trick question in this case; there is no real spec. The most recent formal robots.txt format proposal was published in 1996 (urk!) and that was only a draft which was never sanctified. (It even says clearly at the top, "It is inappropriate to use Internet-Drafts as reference material...") Even specs that have gone through a full review and comment process can be open to interpretation, so it's no surprise that the robots.txt draft spec has some holes. Actually, it is surprisingly complete, considering.

For purposes of this discussion, I'll refer to the Martijn Koster's 1994 document as MK1994 and his 1996 draft spec as MK1996.

Side note: For those inclined to sneer at the idea of treating an unblessed draft (MK1996) as a spec, keep in mind that MK1994 "represents a consensus [ed: of six people] on 30 June 1994 on the robots mailing list", which sounds much less official than an IETF draft document. Interestingly enough, the 1994 robots.txt mailing list discussion is still online and there's even Python source code for a robots.txt parser posted on 21 Jun 1994 by some guy from the Netherlands. I wonder where that guy is today. Probably toiling away in obscurity.  ;-)

This code is intended to interpret all robots.txt files written according to MK1994/96. In particular, this module supports the following lesser-known parts of MK1994/96:

The vast majority of robots.txt tutorials and the like make no mention of the features introduced in MK1996 (like Allow: fields). Furthermore, many insist that end-of-line markers must be Unix-style \n even though it is clearly stated in MK1994 and MK1996 that \r, \n and \r\n are all acceptable. Even such luminaries as Wikipedia, Microsoft and the W3C seem unaware of MK1996, although they are willing to quote the older and less formal MK1994. Hrrmph.

Extensions to Published Specifications

In the spirit of "be generous in what you accept", this module also handles some things that are invalid according to the specs.

The most signficant exception this module permits is the presence of non-ASCII characters in the significant fields of robots.txt. ("Significant" here means "anything outside of a comment".) MK1994 doesn't address the subject of non-ASCII or encodings, but MK1996 (in Section 3.3, "Formal Syntax") makes it clear that only ASCII characters are allowed in significant fields. (Side note – actually only a subset of printable ASCII characters are allowed; but you'll have to read the spec yourself to get the gory details.) However, a survey of real-world robots.txt files shows that about one in every thousand includes non-ASCII in significant fields. Python's robotparser module rolls over and dies when it encounters these. In contrast, this module attempts to decode the file using (a) the encoding specified in the HTTP Content-Type header sent with robots.txt (if present, which is rare) or (b) a default of ISO-8859-1 as per the HTTP spec RFC 2616. This solves the encoding problems for nearly all non-ASCII robots.txt files.

Also, RobotExclusionRulesParser accepts "user-agent" or "useragent" in robots.txt whereas MK1994/96 only permit the former.

Non-compliance with Published Specifications

MK1996 contradicts MK1994 somewhat. MK1994 (which only defines disallows) says that a blank path indicates nothing is disallowed. MK1996 (which defines both allows and disallows) doesn't permit blank paths (the minimal path is a single slash: /) but doesn't mention anything about this change in the section on backwards compatibility. AFAIK blank paths are still widely used in Disallow lines which is consistent with the fact that most of the Net seems to ignore MK1996 and regard MK1994 as the de facto standard.

In the absence of other guidance, this code interprets blank disallow lines as meaning nothing is disallowed and for consistency interprets blank allow lines as meaning nothing is allowed. Thus, these two rules mean the same thing:

        # Disallow everything
        User-agent: foobot
        Disallow: /

        # Allow nothing
        User-agent: foobot
        Allow:

Non-compliance with Unpublished Specifications

Yahoo Slurp, Inktomi, MSNBot and perhaps some other bots support a Crawl-Delay: n specification, where n is the number of seconds that a bot should wait between requesting pages. This module doesn't support that extension to MK1994/96.

Also, this module doesn't support any of Google's extensions to the robots.txt standards.

Differences Between This Module and Python's robotparser

First of all, thanks to Bastian Kleineidam for writing Python's robotparser module. Parts of this module were inspired (directly and indirectly) by his work. Here's a list of differences between robotparser and my robotexclusionrulesparser in rough order of significance.

  1. This module accepts non-ASCII characters in robots.txt. It decodes the file with the encoding specified in the HTTP Content-type header sent with robots.txt file. If no encoding is specified, it defaults to ISO-8859-1 per the HTTP specs. (Specifically, HTTP 1.0 section 3.6.1 and HTTP 1.1 section 3.7.1.)
  2. This module implements the "Expiration" section of MK1996. Specifically, it looks for an HTTP Expires header when fetching robots.txt. If it finds one, it stores that expiration date. Otherwise it uses the MK1996 default of one week. The function is_expired() makes use of this date; see the usage notes for more information. Consequently, this module dispenses with the modified() and mtime() functions provided by robotparser. However, I deliberately left the instance variable expiration_date easily accessible in case you want to mess with it.
  3. This module handles HTTP fetching errors differently than robotparser. MK1996 is mostly non-committal on this topic, so handling of these codes is somewhat implementation dependent. IMHO, robotparser complies with the letter but not the spirit of MK1996 with regards to handling error codes. MK1996 says, "On the request attempt resulted in temporary failure [sic] a robot should defer visits to the site until such time as the resource can be retrieved". It also says that that "is not required...[but is] recommended". robotparser handles those errors internally (e.g. a 503 Service Unavailable is interpreted as "allow all"); this module punts such errors up to the caller so that she can decide how to handle them.
  4. There's a bug in robotparser's handling of robots.txt files that contain a BOM (byte order mark). It doesn't make any accomodation for them, so it might see the first line of a robots.txt file with a UTF-8 BOM as this:
    User-agent: foobot

    The bug can have significant consequences when robots.txt consists of this:
    [BOM]User-agent: *
    Disallow: /

    The user-agent line will be seen as garbage and so the disallow rule will be ignored. The result will be that all robots will be permitted everywhere which is the exact opposite of what the robots.txt author intended.

    Robotexclusionrulesparser doesn't get confused by BOMs; it simply ignores them.

  5. This module adds a user_agent attribute that, if populated, is sent in lieu of Python's user agent when fetching robots.txt from the Web.
  6. This module's parse() function accepts a string; that string can be Unicode. If it isn't Unicode, it's decoded to Unicode using ISO-8859-1.
  7. This module adds a response_code attribute that reports (what else?) the response code when a robots.txt file is fetched from a remote server.
  8. This module accepts "user-agent" or "useragent" as being valid in robots.txt. The spec permits only the former.
  9. There's a bug in robotparser's handling of paths that contain a %-encoded forward slash; MK1996 says that they shouldn't be translated but the robotparser module does. This module replaces that bug with newer, more interesting bugs. =)

Usage

The RobotExclusionRulesParser class exposes four functions and four attributes. General usage is to call fetch() to set up the parser and then call is_allowed(). If your code is long-running, you'll also want to call is_expired() occasionally. Everything else is non-essential.

Functions

fetch(URL)

Fetch robots.txt from the URL provided. This function also sets the expiration_date attribute.

parse(content)

Parse the content of a robots.txt file. This is useful if your robots.txt file isn't HTTP-accessible, or if you just want to experiment. The unit tests make heavy use of this function.

is_allowed(UserAgent, URL)

Return a boolean indicating whether or not the given user agent is allowed to visit the URL. The user agents listed in robots.txt only need be present as a substring in the UserAgent parameter for this function to match them; the comparison is case-insensitive. e.g. passing a UserAgent of Mozilla/5.0 (compatible; Foobot/2.1) would match the user agent rule foobot.

The scheme and authority are discarded from the URL when comparing it to robots.txt rules. (e.g. http://www.example.com/foo/bar.html becomes /foo/bar.html.) This is the way you want it to work -- the rules in robots.txt don't specify scheme and authority themselves, so one can't match against them.

is_expired()

Return a boolean indicating whether or not the parser has passed its expiration date (the dreaded "not-so-fresh" feeling). The expiration date is set when you call fetch() either by reading the HTTP Expires header or by using a default of seven days. See also the related attributes expiration_date and use_local_time.

Attributes

user_agent

Send this user agent string to the server when fetching robots.txt. If left blank, Python's default is used.

source_url

This read-only attribute reports the URL that you used in the most recent call to fetch(). This is useful when the parser's expiration date passes because you can just call parser.fetch(parser.source_url) to refresh the parser.

use_local_time

A boolean that tells RobotExclusionRulesParser whether expiration_date should be in local time or UTC (a.k.a. Greenwich Mean Time). Since expiration_date is set when you call fetch(), you must set use_local_time before calling fetch() for it to have any effect.

If you only call is_expired() and never look at expiration_date, you can leave use_local_time at its default (True).

expiration_date

A timestamp that states when the robots.txt contents are out of date. The timestamp is a Unix-style timestamp; i.e. a float counting the number of seconds since the epoch. The function is_expired() will compare this to "now" for you.

response_code

The response code received during the last fetch from a remote server, or None if fetch has not been called. When using Python ≤ 2.3, this information is less precise. It is set to 200 if the fetch is successful or None otherwise. (Older versions of Python don't provide this information and so I have to fake it.)

This attribute is read-only.

Exceptions

Users of this module should be aware that it raises a few exceptions. Some of them are impossible to finesse internally (Unicode errors, for instance). Others are deliberately exposed because handling them is outside of the scope of the robots.txt specifications and thus outside of the scope of this module. (If, for instance, a robots.txt is present but an error occurs during its transmission.)

The first exception explicitly raised by this code is a Unicode exception (some flavor of UnicodeError). You can see that in two different situations. First, if the parser fetches a robots.txt file that can't be decoded using the encoding specified in the HTTP response header. (That encoding defaults to ISO-8859-1 which is a superset of US-ASCII which is what > 99.9% of existing robots.txt files use.) Second, you'll see a Unicode exception if you feed a non-Unicode string (i.e. isinstance(YourString, unicode) == False) to parse() and that string can't be decoded using ISO-8859-1.

The second exception explicitly raised by this code is a urllib2.URLError exception. fetch() uses urllib2.urlopen() and if that function raises an exception, the exception is passed up to the caller after being massaged to make it a little nicer to deal with.

Note that not all non-200 response codes raise an exception. Those for which MK1996 defines specific actions are handled internally –

If the RobotExclusionRulesParser raises a URLError exception that the caller decides isn't fatal (e.g. the response code 410 Gone), she can just call parser.parse("") and use the parser as normal.

Note that although urllib2 handles most redirects by itself, urllib2 can return 301/302 as the response code if the server generates an infinite loop of 301/302 redirects. Users of this module should be prepared to handle that response code.

These aren't the only exceptions that you might see, they're just the ones that the code raises explicitly. Another likely source for exceptions is the unicode() function. The function fetch() gets the encoding from the Content-Type header that comes with the robots.txt file. That encoding gets passed directly to unicode(). When I first started using unicode() I naïvely expected that an encoding Python didn't understand would be bounced back as a LookupError. Apparently I wasn't alone in thinking that; Python bug 960874 was filed for that reason. But it was closed as no bug/no fix with the explanation that, it is not guaranteed that you will only see LookupErrors (the same is true for most other Python APIs, e.g. most can generate MemoryErrors). Possible other errors are ValueErrors, NameErrors, ImportErrors, etc. etc.. Explanations like this make me long for a construct like Java's throws keyword. :-/

You also need to watch out for a variety of exceptions from fetch() because it calls urllib2 which calls httplib which calls socket. I repackage some of the exceptions but others get raised up to the caller untouched. Experience has taught me to watch out for httplib.BadStatusLine, socket.error and socket.timeout. You also need to handle Python bug 900744 which causes httplib to raise ValueError in some cases. This affects Python 2.4 but not 2.3. AFAIK there is no workaround other than to apply the patch supplied.

Last but not least, there's a bug in urllib2 that raises OSError on rare occasions. This has been fixed but as of Python 2.5.1 the patch has not yet been integrated.

A Simple Example

For more examples, see parser_test.py.

    import robotexclusionrulesparser

    rerp = robotexclusionrulesparser.RobotExclusionRulesParser()

    # I'll set the (optional) UserAgent before calling fetch.
    rerp.user_agent = "Foobot/2.1 (See http://example.net/foobot.html for info)"

    # Note that there should be a try/except here to handle urllib2.URLError,
    # socket.timeout, UnicodeError, etc.
    rerp.fetch("http://www.example.org/robots.txt")

    for UserAgent, url in ListOfUserAgentAndUrlPairs:
        print "Can %s fetch %s? %s" % \
            (UserAgent, url, rerp.is_allowed(UserAgent, url))

Version History

License

This code is copyright Philip Semanchuk under the Gnu Public License.

Contact

Comments, bug reports, etc. are most welcome.