Robotexclusionrulesparser is a BSD-licensed alternative to and improvement on the Python standard library module robotparser. You can download the full package (containing documentation, unit tests, etc.) or just the individual module.
Robotexclusionrulesparser runs under Python 2.4 – 3.2. It hasn't been tested with Python ≤ 2.4, or > 3.2, but it might work with those versions.
Full package: robotexclusionrulesparser-1.6.1.tar.gz [md5 sum]
Just the module: robotexclusionrulesparser-1.6.1.py
This documentation refers to three similar-but-different robots.txt standards called MK1994, MK1996 and GYM2008. In short, the first two comprise the traditional robots.txt standard and the last describes some extensions. There's details about these robots.txt standards below.
This module offers a class (RobotFileParserLookalike) that is a functional drop-in replacement for the standard library's robotparser.RobotFileParser. You can also use the slightly nicer but different interface exposed by the class RobotExclusionRulesParser. Both classes differ from the Python standard library module as desrcibed below.
The bug can have significant consequences when robots.txt consists of this:
The user-agent line will be seen as garbage and so the disallow rule will be ignored. The result will be that all robots will be permitted everywhere which is the exact opposite of what the robots.txt author intended.
This module doesn't get confused by BOMs; it simply ignores them.
The module has two classes: RobotExclusionRulesParser and RobotFileParserLookalike. The latter offers all the features of the former and also bolts on an API that makes it a drop-in replacement for the standard library's robotparser.RobotFileParser.
The module defines the constants MK1996 and GYM2008 which refer to the different syntaxes that this module understands. MK1996 is the traditional syntax; GYM2008 respects wildcards in paths.
A RobotExclusionRulesParser instance has five functions and five attributes. The most common usage is to call fetch() to set up the parser and then call is_allowed(). If your code is long-running, you'll also want to call is_expired() occasionally. Everything else is non-essential. The constructor takes no parameters.
The timeout (supported under Python > 2.5) is a float measured in seconds. If a timeout occurs, urllib2.URLError is raised under Python 2 and socket.timeout under Python 3.
For instance, passing a user_agent of Mozilla/5.0 (compatible; Foobot/2.1) would match the user agent rule foobot.
The scheme and authority are discarded from the URL when comparing it to robots.txt rules. (e.g. http://www.example.com/foo/bar.html becomes /foo/bar.html.) This is the way you want it to work -- the rules in robots.txt don't specify scheme and authority themselves, so one can't match against them.
The syntax parameter must be one of GYM2008 (the default) or MK1996. The former indicates that the module should respect wildcards while the latter indicates that * and $ should be treated as literals.
Return a boolean indicating whether or not the parser has passed its expiration date (the dreaded "not-so-fresh" feeling). The expiration date is set when you call fetch() either by reading the HTTP Expires header or by using a default of seven days. See also the related attributes expiration_date and use_local_time.
If you only call is_expired() and never look at expiration_date, you can leave use_local_time at its default (True).
This attribute is read-only.
This attribute is read-only and defaults to an empty list.
RobotFileParserLookalike is a drop-in replacement for the standard library module, so I refer you to the documentation for robotparser.RobotFileParser for API details. Only differences from the standard library module are mentioned below.
In other words, unless one calls modified(), mtime() will always return 0.
This module's mtime() method mimics the behavior of the standard library module, not its documentation.
RobotFileParserLookalike exposes last_checked but none of entries, default_entry, disallow_all or allow_all.
Users of this module should be aware that it raises a few exceptions. Some of them are impossible to finesse internally (Unicode errors, for instance). Others are deliberately exposed because handling them is outside of the scope of the robots.txt specifications and thus outside of the scope of this module. (If, for instance, a robots.txt is present but an error occurs during its transmission.)
The first exception explicitly raised by this code is a Unicode exception (some flavor of UnicodeError). You can see that in two different situations. First, if the parser fetches a robots.txt file that can't be decoded using the encoding specified in the HTTP response header. (That encoding defaults to ISO-8859-1 which is a superset of US-ASCII which is what > 99.9% of existing robots.txt files use.) Second, you'll see a Unicode exception if you feed a non-Unicode string (i.e. isinstance(YourString, unicode) == False) to parse() and that string can't be decoded using ISO-8859-1.
The second exception explicitly raised by this code is a urllib2.URLError exception. fetch() uses urllib2.urlopen() and if that function raises an exception, the exception is passed up to the caller after being massaged to make it a little nicer to deal with.
Note that not all non-200 response codes raise an exception. Those for which MK1996 defines specific actions are handled internally –
If the RobotExclusionRulesParser raises a URLError exception that the caller decides isn't fatal (e.g. the response code 410 Gone), she can just call parser.parse("") and use the parser as normal.
Note that although urllib2 handles most redirects by itself, urllib2 can return 301/302 as the response code if the server generates an infinite loop of 301/302 redirects. Users of this module should be prepared to handle that response code.
These aren't the only exceptions that you might see, they're just the ones that
the code raises explicitly. Another likely source for exceptions is the
unicode() function. The function fetch() gets the encoding from
the Content-Type header that comes with the robots.txt file. That encoding gets passed directly
to unicode(). When I first started using unicode() I naïvely
expected that an encoding Python didn't understand would be bounced back as a LookupError.
Apparently I wasn't alone in thinking
bug 960874 was filed for that reason. But it was closed as no bug/no fix with the
it is not guaranteed that you will only see LookupErrors (the same is
true for most other Python APIs, e.g. most can generate MemoryErrors). Possible other errors
are ValueErrors, NameErrors, ImportErrors, etc. etc.. Explanations like this make me
long for a construct like Java's throws keyword. :-/
You also need to watch out for a variety of exceptions from fetch() because it calls urllib2 which calls httplib which calls socket. I repackage some of the exceptions but others get raised up to the caller untouched. Experience has taught me to watch out for httplib.BadStatusLine, socket.error and socket.timeout. You also need to handle Python bug 900744 which causes httplib to raise ValueError in some cases. This affects Python 2.4 but not 2.3. AFAIK there is no workaround other than to apply the patch supplied.
Last but not least, there's a bug in urllib2 that raises OSError on rare occasions. This has been fixed but as of Python 2.5.1 the patch has not yet been integrated.
For extensive examples, see parser_test.py.
import robotexclusionrulesparser rerp = robotexclusionrulesparser.RobotExclusionRulesParser() # I'll set the (optional) user_agent before calling fetch. rerp.user_agent = "Foobot/2.1 (See http://example.net/foobot.html for info)" # Note that there should be a try/except here to handle urllib2.URLError, # socket.timeout, UnicodeError, etc. rerp.fetch("http://www.example.org/robots.txt") user_agents_and_urls = [ ("Foobot", "/index.html"), ("Barbot", "/") ] for user_agent, url in user_agents_and_urls: print "Can %s fetch '%s'? %s" % \ (user_agent, url, rerp.is_allowed(user_agent, url))
"Does it comply with the spec?" is a trick question in this case; there is no real spec. The most recent formal robots.txt format proposal was published in 1996 (urk!) and that was only a draft which was never sanctified. (It says clearly at the top, "It is inappropriate to use Internet-Drafts as reference material...") Even specs that have gone through a full review and comment process can be open to interpretation, so it's no surprise that the robots.txt draft spec has some holes. Actually, it is surprisingly complete, considering.
In addition, Google, Yahoo and Microsoft announced in 2008 that they would jointly support extensions to the robots.txt syntax. Although a set of blog postings feels even a less official than an unblessed draft RFC, this syntax is quickly becoming the de facto standard.
I refer to Martijn Koster's relatively well-known 1994 document as MK1994, his lesser-known but more formal 1996 draft spec as MK1996, and the Google-Yahoo-Microsoft syntax as GYM2008.
This module implements all of MK1994, MK1996 and GYM2008. In particular, it supports the following lesser-known parts of MK1994/96:
The vast majority of robots.txt tutorials and the like make no mention of the features introduced in MK1996 (like Allow: fields) or wrongly attribute them to GYM2008. Furthermore, many insist that end-of-line markers must be Unix-style \n even though it is clearly stated in MK1994 and MK1996 that \r, \n and \r\n are all acceptable. Even such luminaries as Wikipedia, Microsoft and the W3C seem unaware of MK1996, although they are willing to quote the older and less formal MK1994. Hrrmph.
GYM2008 consists of three small extensions to MK1994/96. Google describes two of them here but you'll have to visit Yahoo for an explanation of Crawl-delay.
Two of the three GYM2008 extensions are harmless. The Crawl-delay and Sitemap directives are ignored by older parsers and are a useful addition to the standard.
The GYM2008 allowance for path wildcards is less benign because it breaks parsers that obey MK1994/96. For instance, consider the following robots.txt:
User-agent: * Disallow: *
The User-agent line is valid in both MK1994/96 and GYM2008; it means simply "all user agents". But the Disallow path wildcard is specific to GYM2008 syntax. In the traditional MK1994/96 syntax, all paths are treated literally, so this robots.txt says that only the file with the unlikely name '*' is disallowed. Under GYM2008 syntax rules, all files are disallowed.
This problem is exacerbated by the fact that many new Webmasters adopt the GYM2008 syntax without realizing that it is relatively new and in conflict with the traditional syntax. As a result, if a bot that's been behaving perfectly well for 10+ years encounters a robots.txt like the one above, it may assume (correctly!) that it is permitted access to all files on the site although the Webmaster assumes just the opposite. The Webmaster will assume that the bot is ill-behaved and may go so far as to ban it.
This module defaults to GYM2008 syntax. This is unlikely to cause problems because the characters that GYM2008 reserves for special treatment (* and $) are unlikely to occur as path literals. In other words, a GYM2008-aware parser like this one is extremely unlikely to misinterpret a robots.txt written to MK1994/96 standards. Note that the reverse is not true – it's very likely that a parser unaware of GYM2008 will misinterpret the intent of a robots.txt that uses GYM2008-specific syntax.
In the spirit of "be generous in what you accept", this module also handles some things that are invalid according to the specs.
First, RobotExclusionRulesParser accepts "user-agent" or "useragent" in robots.txt whereas MK1994/96 only permit the former.
The second and most signficant exception this module permits is the presence of non-ASCII characters in the significant fields of robots.txt. ("Significant" here means "anything outside of a comment".) MK1994 doesn't address the subject of non-ASCII or encodings, but MK1996 (in Section 3.3, "Formal Syntax") makes it clear that only ASCII characters are allowed in significant fields. (Side note – actually only a subset of printable ASCII characters are allowed; but you'll have to read the spec yourself to get the gory details.)
Despite what MK1996 says, a survey of real-world robots.txt files shows that about one in every thousand includes non-ASCII in significant fields. Python's robotparser module rolls over and dies when it encounters these. In contrast, this module attempts to decode the file using (a) the encoding specified in the HTTP Content-Type header sent with robots.txt (if present, which is rare) or (b) a default of ISO-8859-1 as per the HTTP spec RFC 2616. This solves the encoding problems for nearly all non-ASCII robots.txt files.
MK1996 contradicts MK1994 somewhat. MK1994 (which only defines disallows) says that a blank path indicates nothing is disallowed. MK1996 (which defines both allows and disallows) doesn't permit blank paths (the minimal path is a single slash: /) but doesn't mention anything about this change in the section on backwards compatibility. AFAIK blank paths are still widely used in Disallow lines which is consistent with the fact that most of the Net seems to ignore MK1996 and regard MK1994 as the de facto standard.
In the absence of other guidance, this code interprets blank disallow lines as meaning nothing is disallowed and for consistency interprets blank allow lines as meaning nothing is allowed. Thus, these two rules mean the same thing:
# Disallow everything User-agent: foobot Disallow: / # Allow nothing User-agent: foobot Allow:
Fixed the documentation (this file) about sitemap which was deprecated in version 1.4 in favor of sitemaps.
This code is copyright Philip Semanchuk under a 3-clause BSD license.
Thanks to Bastian Kleineidam for writing Python's robotparser module. Parts of this module were inspired (directly and indirectly) by his work.
Comments, bug reports, etc. are most welcome.