There are tons of XML Sitemap Generators out there on the web already, so why would anyone need another one? Well, one of the points of sitemaps is to give search engines a helping hand when they’re crawling and indexing your website.
Stop to think for a second about how these sitemap generator sites actually generate your sitemap. They crawl your site, and produce a nice looking sitemap file. Okay, but how the heck are these sites going to give any extra information to major search engines like Google who have the most advanced crawlers on earth?
The answer is they’re not, which is where the Perl script below comes in. It’s an extremely basic building block which simply accepts a text list of URLs (from “urls.txt”) and throws out an XML Sitemap file containing those URLs.
So how do you get the list of URLs, and what advantage does it hold over a third party generator? You can get the URLs by using your own spider or an off-the-shelf solution like Sphider; the benefit of this is you can specify crawling rules which exclude pages which you don’t want to be indexed or that have no value for search engines.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | #!/usr/bin/perl -w use strict; open(URLS, "urls.txt") or die "Error: $!"; open(MAP, ">sitemap.xml") or die "Error: $!"; print MAP '<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'; while (<URLS>) { my $url = $_; chomp($url); print MAP "\n <url>\n <loc>$url</loc>\n </url>"; } print MAP "\n</urlset>"; |

One Response to Rudimentary Sitemap Generator
Tom February 17, 2011
It’s pretty easy define crawl filters using most sitemap generator tools, and e.g. A1 Sitemap Generator can dig through HTML forms and most Javascript.