[web-crawler] Get a list of URLs from a site

I'm deploying a replacement site for a client but they don't want all their old pages to end in 404s. Keeping the old URL structure wasn't possible because it was hideous.

So I'm writing a 404 handler that should look for an old page being requested and do a permanent redirect to the new page. Problem is, I need a list of all the old page URLs.

I could do this manually, but I'd be interested if there are any apps that would provide me a list of relative (eg: /page/path, not http:/.../page/path) URLs just given the home page. Like a spider but one that doesn't care about the content other than to find deeper pages.

This question is related to web-crawler

The answer is


wget from a linux box might also be a good option as there are switches to spider and change it's output.

EDIT: wget is also available on Windows: http://gnuwin32.sourceforge.net/packages/wget.htm


do wget -r -l0 www.oldsite.com

Then just find www.oldsite.com would reveal all urls, I believe.

Alternatively, just serve that custom not-found page on every 404 request! I.e. if someone used the wrong link, he would get the page telling that page wasn't found, and making some hints about site's content.


The best on I have found is http://www.auditmypc.com/xml-sitemap.asp which uses Java, and has no limit on pages, and even lets you export results as a raw URL list.

It also uses sessions, so if you are using a CMS, make sure you are logged out before you run the crawl.


Here is a list of sitemap generators (from which obviously you can get the list of URLs from a site): http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators

Web Sitemap Generators

The following are links to tools that generate or maintain files in the XML Sitemaps format, an open standard defined on sitemaps.org and supported by the search engines such as Ask, Google, Microsoft Live Search and Yahoo!. Sitemap files generally contain a collection of URLs on a website along with some meta-data for these URLs. The following tools generally generate "web-type" XML Sitemap and URL-list files (some may also support other formats).

Please Note: Google has not tested or verified the features or security of the third party software listed on this site. Please direct any questions regarding the software to the software's author. We hope you enjoy these tools!

Server-side Programs

  • Enarion phpSitemapsNG (PHP)
  • Google Sitemap Generator (Linux/Windows, 32/64bit, open-source)
  • Outil en PHP (French, PHP)
  • Perl Sitemap Generator (Perl)
  • Python Sitemap Generator (Python)
  • Simple Sitemaps (PHP)
  • SiteMap XML Dynamic Sitemap Generator (PHP) $
  • Sitemap generator for OS/2 (REXX-script)
  • XML Sitemap Generator (PHP) $

CMS and Other Plugins:

  • ASP.NET - Sitemaps.Net
  • DotClear (Spanish)
  • DotClear (2)
  • Drupal
  • ECommerce Templates (PHP) $
  • Ecommerce Templates (PHP or ASP) $
  • LifeType
  • MediaWiki Sitemap generator
  • mnoGoSearch
  • OS Commerce
  • phpWebSite
  • Plone
  • RapidWeaver
  • Textpattern
  • vBulletin
  • Wikka Wiki (PHP)
  • WordPress

Downloadable Tools

  • GSiteCrawler (Windows)
  • GWebCrawler & Sitemap Creator (Windows)
  • G-Mapper (Windows)
  • Inspyder Sitemap Creator (Windows) $
  • IntelliMapper (Windows) $
  • Microsys A1 Sitemap Generator (Windows) $
  • Rage Google Sitemap Automator $ (OS-X)
  • Screaming Frog SEO Spider and Sitemap generator (Windows/Mac) $
  • Site Map Pro (Windows) $
  • Sitemap Writer (Windows) $
  • Sitemap Generator by DevIntelligence (Windows)
  • Sorrowmans Sitemap Tools (Windows)
  • TheSiteMapper (Windows) $
  • Vigos Gsitemap (Windows)
  • Visual SEO Studio (Windows)
  • WebDesignPros Sitemap Generator (Java Webstart Application)
  • Weblight (Windows/Mac) $
  • WonderWebWare Sitemap Generator (Windows)

Online Generators/Services

  • AuditMyPc.com Sitemap Generator
  • AutoMapIt
  • Autositemap $
  • Enarion phpSitemapsNG
  • Free Sitemap Generator
  • Neuroticweb.com Sitemap Generator
  • ROR Sitemap Generator
  • ScriptSocket Sitemap Generator
  • SeoUtility Sitemap Generator (Italian)
  • SitemapDoc
  • Sitemapspal
  • SitemapSubmit
  • Smart-IT-Consulting Google Sitemaps XML Validator
  • XML Sitemap Generator
  • XML-Sitemaps Generator

CMS with integrated Sitemap generators

  • Concrete5

Google News Sitemap Generators The following plugins allow publishers to update Google News Sitemap files, a variant of the sitemaps.org protocol that we describe in our Help Center. In addition to the normal properties of Sitemap files, Google News Sitemaps allow publishers to describe the types of content they publish, along with specifying levels of access for individual articles. More information about Google News can be found in our Help Center and Help Forums.

  • WordPress Google News plugin

Code Snippets / Libraries

  • ASP script
  • Emacs Lisp script
  • Java library
  • Perl script
  • PHP class
  • PHP generator script

If you believe that a tool should be added or removed for a legitimate reason, please leave a comment in the Webmaster Help Forum.


So, in an ideal world you'd have a spec for all pages in your site. You would also have a test infrastructure that could hit all your pages to test them.

You're presumably not in an ideal world. Why not do this...?

  1. Create a mapping between the well known old URLs and the new ones. Redirect when you see an old URL. I'd possibly consider presenting a "this page has moved, it's new url is XXX, you'll be redirected shortly".

  2. If you have no mapping, present a "sorry - this page has moved. Here's a link to the home page" message and redirect them if you like.

  3. Log all redirects - especially the ones with no mapping. Over time, add mappings for pages that are important.


I would look into any number of online sitemap generation tools. Personally, I've used this one (java based)in the past, but if you do a google search for "sitemap builder" I'm sure you'll find lots of different options.


Write a spider which reads in every html from disk and outputs every "href" attribute of an "a" element (can be done with a parser). Keep in mind which links belong to a certain page (this is common task for a MultiMap datastructre). After this you can produce a mapping file which acts as the input for the 404 handler.