|
| ||||||||||
|
| ||||||||||
|
|
This page explains in more detail how Linklint performs site checks.
Creating Seeds for a Site Check |
A linkset (entered on the command line) specifies a set of links to check. For each linkset a seed is created for starting Linklint's search of your site. If the linkset contains no wildcard characters (@ and #), it must be a single link and the complete linkset becomes a seed file. If the linkset contains wildcard characters, the seed is the longest string of non-wildcard characters starting with the leading "/" and ending with the last "/" before a wildcard. For example, if you specify /@ to check your entire site, Linklint will start with one seed file "/" which is the default file for your root directory (sometimes called your home page).Linklint does not have (or need) a -seed option. A linkset without wildcard characters is the same thing as a seed file. In fact, if you have a list of specific HTML pages to check, just put the paths, (one per line) in a file and tell Linklint that this is a command file (single leading @ sign before the filename). Make sure that you list only the paths (no http://, and no hostname) otherwise Linklint will do a remote URL check on your pages (it will see if the pages exist but it won't check the links on your pages).
Site Check Recursion |
Linklint tries to find all of the pages and files in a site using recursion. Each seed is checked and if it is an HTML file it is parsed creating a new list of files to check. These files are checked creating new lists of files to check and so on. This process continues until one of the mechanisms to stop recursion kicks in.The primary method used to stop recursion is to only check local links. A link is considered local if either: it resolves to a file reference without a scheme or host (i.e. /something), or it resolves to http://hostname/. . . and -host hostname was specified.
The second method for halting recursion is the use of specific linksets. Only HTML pages that match one or more of the linksets you specify will be checked for more links. HTML pages which don't match any of the linksets will be skipped, which means they are checked to see if they exist but none of the links inside the file are added to the list of files to check. You can also specifically -skip sets of HTML files or -limit the total number of HTML files checked.
Parsing HTML Files |
These are the rules Linklint uses to extract links from HTML files.Any tags inclosed inside of comments tags: <!-- . . . -->
or script tags: <script> . . . </script> are ignored.The <base href=URL> tag will cause Linklint to set the base scheme, host, path, and file to the appropriate parts of URL for the remainder of the file. I've tried to emulate the behavior of the Netscape Navigator 3.0 browser. In general missing elements from the front part of a url are filled in from the base specification.
Links are extracted from the following tags:
<a href=LINK name=NAME> <applet code=LINK codebase=BASE> <area href=LINK> <bgsound src=LINK> <body background=LINK> <embed src=LINK> <form action=LINK> <frame src=LINK> <img src=LINK lowsrc=LINK dynsrc=LINK usemap=NAME> <input src=LINK> <map name=NAME> <meta http-equiv=refresh content="... href=LINK"> <script src=LINK> Tag and attribute names are case insensitive. A LINK can be bare or enclosed in single or double quotes. The characters < and > are allowed inside of a tag only if they are enclosed in single or double quotes. Arbitrary whitespace is allowed around the = sign and between a tag's name and its attributes.
Tags and/or attributes that do not match any of the above criteria are ignored.
All the links found on an HTML page are checked. Non-HTML links are checked only for existence. If a link is to an HTML file, it will also get parsed subject to the rules of recursion.
Resolving Links |
In order to be able to follow links properly and to ensure that links get checked only once, all links are made absolute before they are checked. I have tried to use the same rules as a browser for making links absolute. You can use the -db3 flag to see how links get resolved. This flag causes every tag from an HTML file that contains a link to get printed out in the log file followed by the fully expanded link.If a -host is specified, links starting with "http://host" have this text removed, creating a local link. Thus all local links will start with "/" followed by a full path from the server root to the file to be checked.
Default Index Files |
Http servers treat a link to a directory followed by a "/" as a default file. The server will look for a (server specific) default file in the directory and serve that up if it exists. Otherwise the server will generate a listing of all of the files and subdirectories in the directory.Linklint emulates this behavior in local site checks by searching for its own list of default files: home.html, index.html, index.shtml, index.htm, index.cgi, wwwhome.html, and welcome.html. If none of these are found, all the files and subdirectories in the directory are checked. You can change the set of default files Linklint looks for with the -index filename option which will replace the built-in set with the file(s) you specify. On the command line each default file must be preceded with the -index flag. If all of the default files are in lowercase, the search is case insensitive. If any of the files has an uppercase letter, the search is case sensitive.
Server-side Image Maps |
Linklint can check all links that are used in both client-side and server-side image maps. Client-side image maps are handled automatically since Linklint parses the <area href=LINK> tag in HTML files.Server-side image maps are a little bit tricky. Some servers have the imagemap CGI software built-in so links ending in .map are treated as map files and automatically sent to the image map program for processing. Linklint mimics this behavior. Any link ending in .map is parsed as if it were a map file. In addition, all .map links are checked locally even if the -http flag is used since map files are generally not accessible directly via http.
Some servers require server-side image map links to contains the path of the CGI image map program followed by the path to the map file as in:
<a href=/cgi-bin/imagemap/dir/info.map>.
Here /cgi-bin/imagemap is the location of the image map CGI program and /dir/info.map is the location of the map file. Linklint can resolve these links and read the map file (locally only, even if -http is used). However, you must provide the path from your server root directory to your image map program using the -map option. Three common image map specifications are:
For example, if you set "-map /cgi-bin/imagemap", the link /cgi-bin/imagemap/dir/info.map will be transformed to /dir/info.map which will be read in locally and parsed as a map file. You need to be sure to set -root properly for Linklint to be able to find the map file.
- -map /cgi-bin/imagemap
- -map /cgi-bin/imagemap.exe
- -map cgi-bin/htimage
How the Status Cache Works |
Linklint uses a combination of three different methods to keep track of remote URL modification times:These methods are totally transparent to the Linklint user (you). For each URL the most efficient method is tried first, and the checksum is only used as a last resort.
- Last-Modified date
- Many web servers, let Linklint know that last date a file was modified. If this date is available for a page then Linklint uses it for keeping track of changes.
- If-Modified-Since requests
- If the Last-Modified is not available then Linklint tries an If-Modified-Since request. Linklint asks if the page has been modified since the last time (according to Linklint) it was checked.
- Checksum of the remote file
- If neither method above is available on a remote server then Linklint reads in the entire remote file, makes a checksum of its contents and uses this checksum to keep track of changes.
|
| ||||||||||
|
| ||||||||||
|
|
© Copyright 1997 - 2001 James B. Bowlin |