Nikita the Spider

A Robot Exclusion Rules Parser for Python


RSS

Robotexclusionrulesparser is a BSD-licensed alternative to and improvement on the Python standard library module robotparser. It provides the classes RobotExclusionRulesParser and RobotFileParserLookalike. The latter wraps the former and is a drop-in replacement for the parser in Python's standard library.

Robotexclusionrulesparser runs under Python 2.4 – 3.3. It hasn't been tested with Python ≤ 2.4, or > 3.3, but it might work with those versions.

You can download the full package (containing documentation, unit tests, etc.) or just the individual module.

Full package: robotexclusionrulesparser-1.6.2.tar.gz [md5 sum]

Just the module: robotexclusionrulesparser-1.6.2.py

The exact same robotexclusionrulesparser tarball is also available on PyPI.

This documentation refers to three similar-but-different robots.txt standards called MK1994, MK1996 and GYM2008. The first two comprise the traditional robots.txt standard and the last describes some extensions. There's details about these robots.txt standards below.

Differences Between robotexclusionrulesparser and Python's robotparser

This module offers a class (RobotFileParserLookalike) that is a functional drop-in replacement for the standard library's robotparser.RobotFileParser. You can also use the slightly nicer but different interface exposed by the class RobotExclusionRulesParser. Both classes differ from the Python standard library module as described below.

  1. This module understands the GYM2008 syntax. GYM2008 is shorthand for the robots.txt syntax extensions (path wildcards, the Crawl-delay directive and the Sitemap directive) agreed upon by Google, Yahoo and Microsoft in 2008.
  2. This module accepts non-ASCII characters in robots.txt. It decodes the file with the encoding specified in the HTTP Content-type header sent with robots.txt file. If no encoding is specified, it defaults to ISO-8859-1 per the HTTP specifications.
  3. This module implements the "Expiration" section of MK1996. Specifically, it looks for an HTTP Expires header when fetching robots.txt. If it finds one, it stores that expiration date. Otherwise it uses the MK1996 default of one week. The boolean property is_expired makes use of this date; see the usage notes for more information. Consequently, this module dispenses with the modified() and mtime() functions provided by robotparser. However, I deliberately left the instance variable expiration_date easily accessible in case you want to mess with it.
  4. This module handles HTTP fetching errors differently than robotparser. MK1996 is mostly non-committal on this topic, so handling of these codes is somewhat implementation dependent. IMHO, robotparser complies with the letter but not the spirit of MK1996 with regards to handling error codes. MK1996 says, "On the request attempt resulted in temporary failure [sic] a robot should defer visits to the site until such time as the resource can be retrieved". It also says that that "is not required...[but is] recommended". robotparser handles those errors internally (e.g. a 503 Service Unavailable is interpreted as "allow all"); this module punts such errors up to the caller so that she can decide how to handle them.
  5. There's a bug in robotparser's handling of robots.txt files that contain a BOM (byte order mark). It doesn't make any accomodation for them, so it might see the first line of a robots.txt file with a UTF-8 BOM as this:
    User-agent: foobot

    The bug can have significant consequences when robots.txt consists of this:
    [BOM]User-agent: *
    Disallow: /

    The user-agent line will be seen as garbage and so the disallow rule will be ignored. The result will be that all robots will be permitted everywhere which is the exact opposite of what the robots.txt author intended.

    This module doesn't get confused by BOMs; it simply ignores them.

  6. This module adds a user_agent attribute that, if populated, is sent in lieu of Python's user agent when fetching robots.txt.
  7. This module's parse() function accepts a string; that string can be Unicode. If it isn't Unicode, it's converted to Unicode using ISO-8859-1.
  8. This module adds a response_code attribute that reports (what else?) the response code when a robots.txt file is fetched from a remote server.
  9. This module accepts "user-agent" or "useragent" as being valid in robots.txt. The spec permits only the former.
  10. There's a bug in robotparser's handling of paths that contain a %-encoded forward slash; MK1996 says that they shouldn't be translated but the robotparser module does. This module replaces that bug with newer, more interesting bugs. :-)

Usage - General

The module has two classes: RobotExclusionRulesParser and RobotFileParserLookalike. The latter offers all the features of the former and also bolts on an API that makes it a drop-in replacement for the standard library's robotparser.RobotFileParser.

The module defines the constants MK1996 and GYM2008 which refer to the different syntaxes that this module understands. MK1996 is the traditional syntax; GYM2008 respects wildcards in paths.

Usage - Class RobotExclusionRulesParser

A RobotExclusionRulesParser instance has four functions and seven attributes. The most common usage is to call fetch() to set up the parser and then call is_allowed(). If your code is long-running, you'll also want to check is_expired occasionally. Everything else is non-essential. The constructor takes no parameters.

Functions

fetch(url, timeout=None)
Fetch robots.txt from the URL provided and parse it. This method sets expiration_date, source_url, and response_code.

The timeout (supported under Python > 2.5) is a float measured in seconds. If a timeout occurs, urllib2.URLError is raised under Python 2 and socket.timeout under Python 3.

parse(content)
Parse a string representing the content of a robots.txt file. This is useful if your robots.txt file isn't HTTP-accessible, or if you just want to experiment. The unit tests make heavy use of this function.
is_allowed(user_agent, url, syntax=GYM2008)
Return a boolean indicating whether or not the given user agent is allowed to visit the URL. The user agents listed in robots.txt only need be present as a substring in the UserAgent parameter for this function to match them; the comparison is case-insensitive.

For instance, passing a user_agent of Mozilla/5.0 (compatible; Foobot/2.1) would match the user agent rule foobot.

The scheme and authority are discarded from the URL when comparing it to robots.txt rules. (e.g. http://www.example.com/foo/bar.html becomes /foo/bar.html.) This is the way you want it to work -- the rules in robots.txt don't specify scheme and authority themselves, so one can't match against them.

The syntax parameter must be one of GYM2008 (the default) or MK1996. The former indicates that the module should respect wildcards while the latter indicates that * and $ should be treated as literals.

get_crawl_delay(user_agent)
Returns the crawl delay for this user agent as a float, or None if no crawl delay is defined.

Attributes and Properties

user_agent
A read/write string. The parser will send this user agent string to the server when fetching robots.txt. If blank or None (the default), the parser uses Python's default user agent string.
source_url
This read-only property reports the URL that you used in the most recent call to fetch(). This is useful when the parser's expiration date passes because you can simply call parser.fetch(parser.source_url) to refresh the parser.
use_local_time
A read/write boolean that tells RobotExclusionRulesParser whether expiration_date should be in local time or UTC (a.k.a. Greenwich Mean Time). Since expiration_date is set when you call fetch(), you must set use_local_time before calling fetch() for it to have any effect.

If you only check is_expired and never look at expiration_date, you can leave use_local_time at its default (True).

is_expired
A read-only property that contains a boolean indicating whether or not the parser has passed its expiration date (the dreaded "not-so-fresh" feeling). The parser sets the expiration date when you call fetch() either by reading the HTTP Expires header or by using a default of seven days as specified in MK1996 § 3.4. See also the related attributes expiration_date and use_local_time.
expiration_date
A read/write timestamp that states when the robots.txt contents are out of date. The timestamp is a Unix-style timestamp; i.e. a float counting the number of seconds since the epoch. The property is_expired will compare this to "now" for you.
response_code
A read-only property that contains the integer response code received during the last fetch from a remote server, or None if fetch has not been called. When using Python ≤ 2.3, this information is less precise. It is set to 200 if the fetch is successful or None otherwise. (Older versions of Python's urllib don't provide this information and so I have to fake it.)
sitemap
Deprecated. Use sitemaps instead.
sitemaps
A read-only property that returns a list of the sitemap URLs defined in the robots.txt. This module simply reports the data it found after the Sitemap directives. No guarantees are made about whether or not that data contains valid and/or working URLs.

Usage - Class RobotFileParserLookalike

RobotFileParserLookalike is a drop-in replacement for the standard library module, so I refer you to the documentation for robotparser.RobotFileParser for API details. Only differences from the standard library module are mentioned below.

Functional Differences

can_fetch(useragent, url, syntax=GYM2008)
The syntax argument is not present in the standard library version.
mtime()
The robotparser documentation says that this function, "Returns the time the robots.txt file was last fetched". This isn't true, though. It actually returns the last time modified() was called. Furthermore, it's up to the caller to call modified(); it's not called automatically when one calls RobotFileParser.read().

In other words, unless one calls modified(), mtime() will always return 0.

This module's mtime() method mimics the behavior of the standard library module, not its documentation.

Attribute Differences

RobotFileParserLookalike exposes last_checked but none of entries, default_entry, disallow_all or allow_all.

Exceptions

Users of this module should be aware that it raises a few exceptions. Some of them are impossible to finesse internally (Unicode errors, for instance). Others are deliberately exposed because handling them is outside of the scope of the robots.txt specifications and thus outside of the scope of this module. (If, for instance, a robots.txt is present but an error occurs during its transmission.)

The first exception explicitly raised by this code is a Unicode exception (some flavor of UnicodeError). You can see that in two different situations. First, if the parser fetches a robots.txt file that can't be decoded using the encoding specified in the HTTP response header. (That encoding defaults to ISO-8859-1 which is a superset of US-ASCII which is what > 99.9% of existing robots.txt files use.) Second, you'll see a Unicode exception if you feed a non-Unicode string (i.e. isinstance(YourString, unicode) == False) to parse() and that string can't be decoded using ISO-8859-1.

The second exception explicitly raised by this code is a urllib2.URLError exception. fetch() uses urllib2.urlopen() and if that function raises an exception, the exception is passed up to the caller after being massaged to make it a little nicer to deal with.

Note that not all non-200 response codes raise an exception. Those for which MK1996 defines specific actions are handled internally –

If the RobotExclusionRulesParser raises a URLError exception that the caller decides isn't fatal (e.g. the response code 410 Gone), she can just call parser.parse("") and use the parser as normal.

Note that although urllib2 handles most redirects by itself, urllib2 can return 301/302 as the response code if the server generates an infinite loop of 301/302 redirects. Users of this module should be prepared to handle that response code.

These aren't the only exceptions that you might see, they're just the ones that the code raises explicitly. Another likely source for exceptions is the unicode() function. The function fetch() gets the encoding from the Content-Type header that comes with the robots.txt file. That encoding gets passed directly to unicode(). When I first started using unicode() I naïvely expected that an encoding Python didn't understand would be bounced back as a LookupError. Apparently I wasn't alone in thinking that; Python bug 960874 was filed for that reason. But it was closed as no bug/no fix with the explanation that, it is not guaranteed that you will only see LookupErrors (the same is true for most other Python APIs, e.g. most can generate MemoryErrors). Possible other errors are ValueErrors, NameErrors, ImportErrors, etc. etc.. Explanations like this make me long for a construct like Java's throws keyword. :-/

You also need to watch out for a variety of exceptions from fetch() because it calls urllib2 which calls httplib which calls socket. I repackage some of the exceptions but others get raised up to the caller untouched. Experience has taught me to watch out for httplib.BadStatusLine, socket.error and socket.timeout. You also need to handle Python bug 900744 which causes httplib to raise ValueError in some cases. This affects Python 2.4 but not 2.3. AFAIK there is no workaround other than to apply the patch supplied.

Last but not least, there's a bug in urllib2 that raises OSError on rare occasions. This has been fixed but as of Python 2.5.1 the patch has not yet been integrated.

A Simple Example

For extensive examples, see parser_test.py.

    import robotexclusionrulesparser

    rerp = robotexclusionrulesparser.RobotExclusionRulesParser()

    # I'll set the (optional) user_agent before calling fetch.
    rerp.user_agent = "Foobot/2.1 (See http://example.net/foobot.html for info)"

    # Note that there should be a try/except here to handle urllib2.URLError,
    # socket.timeout, UnicodeError, etc.
    rerp.fetch("http://www.example.org/robots.txt")
    
    user_agents_and_urls = [ ("Foobot", "/index.html"), ("Barbot", "/") ]

    for user_agent, url in user_agents_and_urls:
        print "Can %s fetch '%s'? %s" % \
            (user_agent, url, rerp.is_allowed(user_agent, url))

Compliance with Published Specifications

"Does it comply with the spec?" is a trick question in this case; there is no real spec. The most recent formal robots.txt format proposal was published in 1996 (urk!) and that was only a draft which was never sanctified. (It says clearly at the top, "It is inappropriate to use Internet-Drafts as reference material...") Even specs that have gone through a full review and comment process can be open to interpretation, so it's no surprise that the robots.txt draft spec has some holes. Actually, it is surprisingly complete, considering.

In addition, Google, Yahoo and Microsoft announced in 2008 that they would jointly support extensions to the robots.txt syntax. Although a set of blog postings feels even a less official than an unblessed draft RFC, this syntax is quickly becoming the de facto standard.

I refer to Martijn Koster's relatively well-known 1994 document as MK1994, his lesser-known but more formal 1996 draft spec as MK1996, and the Google-Yahoo-Microsoft syntax as GYM2008.

This module implements all of MK1994, MK1996 and GYM2008. In particular, it supports the following lesser-known parts of MK1994/96:

The vast majority of robots.txt tutorials and the like make no mention of the features introduced in MK1996 (like Allow: fields) or wrongly attribute them to GYM2008. Furthermore, many insist that end-of-line markers must be Unix-style \n even though it is clearly stated in MK1994 and MK1996 that \r, \n and \r\n are all acceptable. Even such luminaries as Wikipedia, Microsoft and the W3C seem unaware of MK1996, although they are willing to quote the older and less formal MK1994. Hrrmph.

GYM2008

GYM2008 consists of three small extensions to MK1994/96. Google describes two of them here but you'll have to visit Yahoo for an explanation of Crawl-delay.

Two of the three GYM2008 extensions are harmless. The Crawl-delay and Sitemap directives are ignored by older parsers and are a useful addition to the standard.

The GYM2008 allowance for path wildcards is less benign because it breaks parsers that obey MK1994/96. For instance, consider the following robots.txt:

   User-agent: *
   Disallow: *

The User-agent line is valid in both MK1994/96 and GYM2008; it means simply "all user agents". But the Disallow path wildcard is specific to GYM2008 syntax. In the traditional MK1994/96 syntax, all paths are treated literally, so this robots.txt says that only the file with the unlikely name '*' is disallowed. Under GYM2008 syntax rules, all files are disallowed.

This problem is exacerbated by the fact that many new Webmasters adopt the GYM2008 syntax without realizing that it is relatively new and in conflict with the traditional syntax. As a result, if a bot that's been behaving perfectly well for 10+ years encounters a robots.txt like the one above, it may assume (correctly!) that it is permitted access to all files on the site although the Webmaster assumes just the opposite. The Webmaster will assume that the bot is ill-behaved and may go so far as to ban it.

This module defaults to GYM2008 syntax. This is unlikely to cause problems because the characters that GYM2008 reserves for special treatment (* and $) are unlikely to occur as path literals. In other words, a GYM2008-aware parser like this one is extremely unlikely to misinterpret a robots.txt written to MK1994/96 standards. Note that the reverse is not true – it's very likely that a parser unaware of GYM2008 will misinterpret the intent of a robots.txt that uses GYM2008-specific syntax.

My Extensions to Published Specifications

In the spirit of "be generous in what you accept", this module also handles some things that are invalid according to the specs.

First, RobotExclusionRulesParser accepts "user-agent" or "useragent" in robots.txt whereas MK1994/96 only permit the former.

The second and most signficant exception this module permits is the presence of non-ASCII characters in the significant fields of robots.txt. ("Significant" here means "anything outside of a comment".) MK1994 doesn't address the subject of non-ASCII or encodings, but MK1996 (in Section 3.3, "Formal Syntax") makes it clear that only ASCII characters are allowed in significant fields. (Side note – actually only a subset of printable ASCII characters are allowed; but you'll have to read the spec yourself to get the gory details.)

Despite what MK1996 says, a survey of real-world robots.txt files shows that about one in every thousand includes non-ASCII in significant fields. Python's robotparser module rolls over and dies when it encounters these. In contrast, this module attempts to decode the file using (a) the encoding specified in the HTTP Content-Type header sent with robots.txt (if present, which is rare) or (b) a default of ISO-8859-1 as per the HTTP spec RFC 2616. This solves the encoding problems for nearly all non-ASCII robots.txt files.

Non-compliance with Published Specifications

MK1996 contradicts MK1994 somewhat. MK1994 (which only defines disallows) says that a blank path indicates nothing is disallowed. MK1996 (which defines both allows and disallows) doesn't permit blank paths (the minimal path is a single slash: /) but doesn't mention anything about this change in the section on backwards compatibility. AFAIK blank paths are still widely used in Disallow lines which is consistent with the fact that most of the Net seems to ignore MK1996 and regard MK1994 as the de facto standard.

In the absence of other guidance, this code interprets blank disallow lines as meaning nothing is disallowed and for consistency interprets blank allow lines as meaning nothing is allowed. Thus, these two rules mean the same thing:

        # Disallow everything
        User-agent: foobot
        Disallow: /

        # Allow nothing
        User-agent: foobot
        Allow:

Version History

License

This code is copyright Philip Semanchuk under a 3-clause BSD license.

Thanks to Bastian Kleineidam for writing Python's robotparser module. Parts of this module were inspired (directly and indirectly) by his work.

Contact

Comments, bug reports, etc. are most welcome.