Checkbot -- a WWW link verifier Checkbot is a perl5 script which can verify links within a region of the World Wide Web. It checks all pages within an identified region, and all links within that region. After checking all links within the region, it will also check all links which point outside of the region, and then stop. Checkbot regularly writes reports on its findings, including all servers found in the region, and all links with problems on those servers. Checkbot was written originally to check a number of servers at once. This has implied some design decisions, so you might want to keep that in mind when making suggestions. Speaking of which, be sure to check the to do file on the website for things which have been suggested for Checkbot. INSTALLATION Making and installing Checkbot is easy: perl Makefile.PL make make install You will need to have the following Perl modules installed in order to properly install Checkbot: LWP URI HTML::Parser MIME::Base64 Net::FTP Mail::Send (optional, contained in the MailTools package) Time::Duration (optional, used for additional info in report) WHERE TO FIND IT Checkbot is distributed at: http://degraaff.org/checkbot/ Problems, bug reports, and feature enhancements are welcome at http://sourceforge.net/projects/checkbot/ There is an announcement mailing list to which announcements of new versions are posted. You can sign up for the list at https://lists.sourceforge.net/lists/listinfo/checkbot-announce Hans de Graaff RECENT CHANGES Changes in versino 1.80 (15-Oct-2008) * Fix handling of nofollow robots tag. * Require newer version of LWP for better handling of character encodings. * Ignore mms scheme. * Minor clarification in output. Changes in version 1.79 (3-Feb-2007) * Correctly parse documents to avoid problems with UTF-8 documents. This avoids the "Parsing of undecoded UTF-8 will give garbage when decoding entities" messages. * Allow regular expressions in the suppression file, and complain if the suppression file is not a proper file. * More robust handling of HTTP and FTP servers that have problems responding to HEAD requests. * Use the original URL to report problems. * Ensure XHTML compliance. Changes in version 1.78 (3-May-2006) * Don't throw errors for links that cannot be expected to be valid all the time (e.g. the classid attribute of an object element) * Better fallbacks for some cases where the HEAD request does not work * Add more classes and ids to allow more styling of results pages (including example CSS file) * Ensure XHTML compliance * Better checks for optional dependencies Changes in version 1.77 (28-Jul-2005) * Fix silly build-related problem that prevented checkbot 1.76 from running at all. * Check for presence of robots meta tag and act on it. Changes in version 1.76 (25-Jul-2005) * Error reports now include the page title for easier identification. * javascript: links are now ignored because they cannot be checked. * Documentation updates. Changes in version 1.75 (22-Apr-2004) * New --cookies option to accept cookies from servers while checking. * New --noproxy option indicates which domains should not be passed through the proxy. * New error code for unknown domains; only known non-checkable schemes are ignored now. * Minor bug fixes. * Documentation updates. Changes in version 1.74 (17-Dec-2003) * New --suppress option allows Response code/URL combinations not to be reported as problems. * Checkbot warnings are now handled as pseudo-HTTP status messages so that they can make use of all Checkbot features such as --dontwarn. * Option --allow-simple-hosts is deprecated due to this change. * More robust handling of (lack of) status messages. * Checkbot now requires LWP 5.70 due to bugfixes in this release, although it should still also work with older LWP versions. * Documentation fixes. Changes in version 1.73 (31-Aug-2003) * Checkbot now tries to produce valid XHTML 1.1 * URLs matching the --ignore option are now completely ignored; they used to be checked but not reported. * Proxy support works again, but --proxy now applies to all links * Documentation fixes Changes in version 1.72 (04-May-2003) * URLs with query strings are now checked by default, the --exclude option can be used to revert to the previous behavior * The server results page contains shortcut links to each section * Removed warning for unqualified hostnames for news: URLs * Handling of signals such as SIGINT * Bug and documentation fixes Changes in version 1.71 (29-Dec-2002) * New --filter option allows rewriting of URLs before they will be checked * Problematic links are now reported for each page on which they occur * New statistics which should work correctly * Much simplified storage of information on problem links * Duplicate links are now properly detected and not checked twice * Rewritten internals for link checking, as a consequence internal and external links are checked at the same time now, not in two passes like before * Rewritten internals for message output * A simple test case for 'make test' * Minor cleanups of the code Version 1.70 was only released for testing purposes Changes in version 1.69 * Improved makefile and packaging * Better default for --match argument * Additional instance of using GET instead of HEAD added * Bug fixes in printing of web server feedback Changes in version 1.68 * Add --allow-simple-hosts which doesn't check for unqualified hosts * Mention --style option in help and added example style file * Change --sleep implementation so that fractional seconds can be used * Fix a bug with handling tags * Tighten checks for http and https schemes * Remove harmless warnings