Nikita the Spider

Nikita News and Updates for 2006

December 2006

As of December 29th, Nikita has validated over 1,000,000 pages and checked nearly 3,000,000 links. It's been a good first seven months.

Feature-wise, this has been a quiet month for Nikita. But I added an update to the robots.txt statistics Nikita collected in spring 2006.

As of December 1st, Nikita does a better job of enforcing W3C rules about which media types are valid when sending (X)HTML documents. Specifically, she'll warn about HTML doctypes sent as XHTML and vice versa. The only exception is that it's optional to have Nikita warn about XHTML 1.0 documents sent as text/html, and that option is off by default.

November 2006

Nikita has validated over one-half million pages and checked about 1.4 million links since her inception in late May of this year. And in addition to that, Nikita recently completed spidering a site of over 135,000 pages which is her largest to date.

The swell of traffic that Nikita received last month uncovered a few bugs and highlighted some rough spots. I've tackled as many of these as I can and I'm working on the rest as time permits. I also found time to add an article about protecting email addresses from spam.

As of November 20th, Nikita is smart enough to not follow URLs that contain session IDs. Also, code is now being clattered out on a brand spanking new buckling-spring keyboard which replaces my fifteen year-old IBM dreadnought (which still works fine but lacks some modern meta keys).

October 2006

Nikita was validating about 1000 pages per day (on average) since her inception in late May of this year, before the tsunami of traffic resulting from the 456 Berea Street article.

So far this month I've fixed several bugs in Nikita and made some updates.

First, in response to your requests, Nikita now displays each page's title in the page reports. I also simplified the reports' table of contents and moved some of the information that was there to other reports. The statistics report has some new numbers and should also be a little easier to read. Last but not least, I added @print CSS to the reports so they should print more nicely.

The most visible bug fixes are on the statistics page. Nikita's list of media types now adds up to 100% and the modes are calculated correctly.

September 2006

September 24th Nikita has been validating about 820 pages per day, on average. In between running reports, I made a few updates to the service.

The default politeness delay is now seven seconds instead of eight.

Nikita now has smoother doctype handling. She reports nicknames for common doctypes (i.e. "HTML 4.01 Strict" instead of <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">). For the time being I'm not going to support custom doctypes, only doctypes that Nikita knows about. If a Web page uses a doctype with which Nikita is unfamiliar, she'll validate it against (X)HTML Transitional.

Nikita is now better at handling large Web sites. She successfully spidered one site that was over 40,000 pages but had trouble on another of about the same size. I fixed the problem that caused the latter site's reports to fail.

August 2006

August 1st Nikita has been validating about 500 pages per day, on average.

August 8th The Hot Links report now respects the maximum report size that you specify on the start page so that sites with a lot of hot links don't get stuck with an enormous single-page report.

August 20th It's now much easier to restrict Nikita to just a portion of your Web site. If you supply a path in the seed URL, Nikita will spider only the subtree contained in that path.