I needed to grab the page titles and meta descriptions for a bunch of specific URLs recently and knocked up a quick Perl script to do the hard work for me. Just run the script below from the command line, and paste the URLs you need into a file called ‘urls.txt’ placed in the same folder:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | #!/usr/bin/perl -w use strict; use WWW::Mechanize; open (IN, "urls.txt") or die "Error: $!"; open(OUT, ">meta-data.tsv") or die "Error: $!"; print OUT "URL\tTitle\tDescription\n"; my $i = 1; while (<IN>) { my $url = $_; chomp($url); # Initiate Mechanize bot - add your email address to the # user agent if you're getting a large number of URLs my $mech = WWW::Mechanize->new( agent => 'Meta Data Bot' ); print "$i. Getting $url....\n"; $mech->get($url); my $content = $mech->content; utf8::decode($content); my ($title, $desc); $desc = 'blank'; # Grab page title if ($content =~ m{<title>(.*?)</title>}gsi) { $title = $1; } # Grab meta description - not a great regex but works most of the time if ($content =~ m{<meta name="description" content="([^"]+)"}gis) { $desc = $1; } if ($content =~ m{<meta content="([^"]+)" name="description"}gis) { $desc = $1; } $i++; print OUT "$url\t$title\t$desc\n"; sleep(1); } |

2 Responses to Simple meta data bot in Perl
Web Hosting Nuggets August 28, 2010
Hi Rob,
Simple but effective article! I was just wondering are there any reasons why you are using the “g” global matching modifier when extracting the description meta data? I am asking because I get better matching results without that modifier…
Greetings
rob August 31, 2010
Hmm good question – not really, just out of habit I guess: makes more sense not to use it as you suggest