Simple meta data bot in Perl

I needed to grab the page titles and meta descriptions for a bunch of specific URLs recently and knocked up a quick Perl script to do the hard work for me. Just run the script below from the command line, and paste the URLs you need into a file called ‘urls.txt’ placed in the same folder:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;

open (IN, "urls.txt") or die "Error: $!";

open(OUT, ">meta-data.tsv") or die "Error: $!";
print OUT "URL\tTitle\tDescription\n";
my $i = 1;
while (<IN>) {
    my $url = $_;
    chomp($url);
   
    # Initiate Mechanize bot - add your email address to the
    # user agent if you're getting a large number of URLs
    my $mech = WWW::Mechanize->new( agent => 'Meta Data Bot' );
    print "$i. Getting $url....\n";
    $mech->get($url);
   
    my $content = $mech->content;
    utf8::decode($content);
    my ($title, $desc);
    $desc = 'blank';
   
    # Grab page title
    if ($content =~ m{<title>(.*?)</title>}gsi) {
        $title = $1;
    }
   
    # Grab meta description - not a great regex but works most of the time
    if ($content =~ m{<meta name="description" content="([^"]+)"}gis) {
        $desc = $1;
    }
    if ($content =~ m{<meta content="
([^"]+)" name="description"}gis) {
        $desc = $1;
    }
    $i++;
    print OUT "$url\t$title\t$desc\n";
   
    sleep(1);
}
2 Responses to Simple meta data bot in Perl
  1. Web Hosting Nuggets

    Hi Rob,

    Simple but effective article! I was just wondering are there any reasons why you are using the “g” global matching modifier when extracting the description meta data? I am asking because I get better matching results without that modifier…

    Greetings

  2. rob

    Hmm good question – not really, just out of habit I guess: makes more sense not to use it as you suggest :)

Leave a Reply

Your email address will not be published. Please enter your name, email and a comment.

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>