Nikita the Spider Reports Help
Hot pages are pages on your site that probably need your attention.
Nikita considers a page hot if any of the following are true:
- The page is unreadable
- This happens either when Nikita is unfamiliar with
the encoding specified for the page (which happens rarely) or when the page contains octets
or octet combinations undefined in the specified encoding.
Browsers are very good at handling miscoded pages; Nikita is much less
forgiving. She makes no attempt to decode the page with any likely alternate
encoding nor does she try to ignore troublesome characters. This may be
frustrating but it is consistent with Nikita's goal of helping you to achieve
strict conformance to standards. If Nikita can't understand it, there's a good
chance some other user agents will fail on it too.
- The page specifies multiple encodings
- The encodings can duplicate one another (possibly bad) or conflict (definitely bad).
Duplicate encodings (for instance, UTF-8 specified in both the HTTP
Content-Type header and a META element in the document) are only a problem in
that they violate the DRY
(Don't Repeat Yourself) Principle – they're a first step towards conflicting
encodings.
Conflicting encodings (for instance, UTF-8 specified in the HTTP Content-Type
header and ISO-8859-1 specified in a META element in the document) are
a sign that the page author is confused and that some user agents might not be able to
read the document or render it properly.
You can read in
detail about how Nikita determines a
page's encoding.
- The URL exceeds 72 characters
- Some mail programs break URLs longer than
72 characters which means your URL will be difficult for some people to send successfully
via email.
- The page contains validation errors.
- The doctype is missing or unknown
- Warnings about unknown doctypes
often result from doctype declarations that are lower case where they should not be. (Important
parts of the declaration are case sensitive.) The W3C maintains
a list of commonly-used
doctypes.
- The page's media type and doctype conflict
- For instance,
text/html is not an
acceptable media type for XHTML 1.1 documents.
Hot headers are HTTP headers sent by server which violate one of
the RFCs defining HTTP. Nikita doesn't check all HTTP response headers for validity, just
ones that contain common mistakes. Nikita considers a header hot if any of the following
are true:
- A header contains an invalid date
- RFC 2616 §3.3
defines three acceptable HTTP header date formats for the headers that can contain
dates ('Date', 'Expires', 'Last-Modified' and 'Retry-After'). Here's an example of each
of the acceptable formats:
Sun, 06 Nov 1994 08:49:37 GMT ; RFC 822, updated by RFC 1123
Sunday, 06-Nov-94 08:49:37 GMT ; RFC 850, obsoleted by RFC 1036
Sun Nov 6 08:49:37 1994 ; ANSI C's asctime() format
Sending an absolute value like "0" in an Expires header is a common mistake. Although
the HTTP specification specifically mentions this practice, it also states clearly that
it is invalid
to send "0" in the Expires header.
Absolute values are permitted in the Retry-After header, and Nikita understands that
they're valid.
- A Location header value is not an absolute URL
- Location headers must refer to an absolute URL according
to RFC 2616 §14.30.
Sending a relative URL like /foo/bar.html is a common mistake.
- The response code is unknown
- Nikita will let you know if your server sends a response code that's not in
the HTTP Status Code Registry.
- A header contains invalid characters
- RFC 2616 §4.2
states that only printable ASCII (33 - 126 inclusive) plus whitespace (tab, CR, LF and
space) are valid in header values. Nikita considers a header hot if if contains
characters outside of this range. RFC
2047 §4 explains how to properly encode characters that are out of that range.
- A response lacks a Date header
- RFC 2616 §14.18
states that a Date header must be included with all responses. There are a few response
codes (like those in the 5xx Server Error range) that are exceptions to this rule, and
Nikita takes into account those exceptions.
- URLs inspected
- This a count of the unique URLs on your site which Nikita visited in an attempt to collect
information. It includes URLs that returned a response code of 200 as well as 30x redirects,
404s, etc.
It does not include URLs that were visited only during link
checking. Link checks don't record much besides the response code so Nikita can't
count them as "inspected". There's two reasons why Nikita might have checked a URL on your
site while excluding it from full inspection. The first reason is because your crawl may
have been limited to a certain number of pages. Once Nikita reaches that limit, she
doesn't inspect any more URLs but she will check the remaining URLs that she
knows about. The second reason is because you may have asked Nikita to check parameterized
URLs but not to visit
(inspect) them. The options to visit and check parameterized URLs are both off by default.
- Resources found
- This a count of the unique URLs on your site that Nikita
visited and found present (i.e. returned an HTTP status code of 200). It is a subset
of URLs inspected (above).
- Pages found
- This a count of all the (X)HTML pages on your site. More specifically, it is a count
of the unique URLs on your site that Nikita visited and which returned (a) an HTTP
status code of 200 and (b) an (X)HTML media type. It is a subset of resources found (above).
- Page size -- mean, median and mode
- These statistics are calculated based on the sizes (byte count) of pages delivered
to Nikita. The mean is an arithmetic mean expressed in bytes rounded
off to the nearest integer. The list of mode page sizes can contain duplicate values due
to rounding of the displayed values. In other words, the actual mode values differ but
only by values of less than .01 KiB.
- Hot Pages
- Listed here are the total number of hot pages and the count of pages that are hot
broken down by reason. You can read in detail about what Nikita
considers problematic enough for inclusion in the hot pages list. Note that the
sum of the percentages can exceed 100% because each page can have multiple
reasons for being on the hot list.
The last four hot page statistics contain percentages based on the
total number of readable pages. They refer to problems with the document
content and it doesn't make sense for Nikita to report content errors for pages she
can't read. If there are no unreadable pages on your site, you can ignore this
statistical detail.
- Hot Headers
- Listed here are the total number of URLs on your site that responded with HTTP headers
that violate standards (usually RFC 2616, the HTTP 1.1 specification).
You can read in detail about what Nikita
considers problematic enough for inclusion in the hot headers list.
Note that the
percentages may add up to greater than 100% because each URL can have multiple
reasons for being on the hot list.
-
- Links
- A link is any reference to another resource. For instance, the
href
attribute of an <a> element counts as a link as does the src
attribute of an <img> or <frame> element. Nikita
breaks down the links she
sees by scheme (the part of a URI that comes before the ":" or "://"). Note that
javascript: links are not references to JavaScript files but are
<a> elements coded like
so: <a href='javascript:alert("boo!");'>. Links in the "Other" category
are often mistakes, so watch out for these.
Nikita further divides HTTP, HTTPS and FTP links into internal and outbound based on
the destination of the link. Destinations in the same domain as the seed URL are
internal, all other destinations make a link outbound. Note that if the seed URL is
www.example.com, a URL with a domain like news.example.com or
ftp.example.com is considered outbound.
- Media Types
- This is a list of the media types reported to Nikita in the HTTP Content-Type header
of the URLs she visited. Nikita counts the percentages shown based on the total
number of resources that she found.
URLs that don't supply a media type are assigned one of
application/octet-stream as per
Section 7.2.1
of HTTP 1.1.
- Encodings
- This is a list of the primary encodings supplied for each page. If a
page specifies multiple encodings, only one is listed here. Percentages are calculated
based on the total number of pages. You can read in
detail about how Nikita determines a
page's encoding.
- Doctypes
- This is a list of the doctypes supplied for each page. Percentages are calculated
based on the total number of readable pages.