Top 11 Perl modules for SEO

One of the truly great things about Perl is CPAN (the Comprehensive Perl Archive Network), which is an immense resource for almost all of the common (and not so common) programming functions you could ever dream of – from the web to graphics and operating system interfaces. Although Python and Ruby are gaining in popularity these days, CPAN is a huge asset to Perl that (as far as I’m aware) has few equals in other languages.

I’ve collected below some of the most useful modules I’ve found from an SEO’s point of view:

1. WWW::Mechanize

WWW::Mechanize is described as “handy web browsing in a Perl object”. It’s an immensely powerful scraping, crawling and HTML parsing tool, and supports cookies, browsing history, proxies, custom headers and more. It’s a subclass of LWP::UserAgent, so many of the functions in that module will also work here. There’s a great FAQ available on CPAN, as well as examples of what you might use it for.

It’s very simple to knock up fairly advanced tools in several lines of code – for example the snippet below will print all of the on-page links from a list of URLs:

#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;

my @urls = qw(http://www.bbc.co.uk/ http://searchtalk.co.uk/);

foreach my $url (@urls) {
  my $mech = WWW::Mechanize->new();
  $mech->get($url);
  $mech->dump_links();
}

The amount of things you can do with this module are pretty much limitless – aside from rendering JavaScript, Flash etc, anything you do in your browser can be automated through the use of this module. For example you can create your own APIs into services such as Google Webmaster Tools or Google Insights where the current API options are limited, and there are many other awesome applications that others have built off the back of this module. For further reading I’d recommend the book Spidering Hacks – most of the examples are out of date now but the concepts are pretty easy to adapt to other websites.

2. HTML::TokeParser

HTML::TokeParser essentially treats an HTML page as a series of “tokens”, rather than plain text that you run regular expressions over. This makes it a lot more robust in handling invalid or inconsistently formatted HTML, and is closer in concept to how search engines treat HTML pages. A mistake many people make is to recommend valid HTML as an SEO recommendation, while the reality is search engines don’t care, because they don’t treat HTML as well-formed XML, and so don’t break when a quotation mark is out of place.

There are a bunch of newer modules out that use XPath selectors to parse HTML, which in my experience are a bit easier to use, though perhaps not quite as powerful.

3. URI

URI is an essential module in manipulating URLs, converting relative URLs to absolute, etc.

4. Scrappy

Finding the XPath/CSS3 selector is easy with Firebug

Scrappy is a truly awesome module that integrates the WWW::Mechanize and Web::Scraper modules to make scraping and crawling even easier. One of the best features is that you can use XPath or CSS3 selectors to extract info from a webpage rather than labouring over increasingly complex regular expressions. It makes crawling a sych as well, and supports multi-threaded crawling for speeding up your scripts. Writing a very basic multi-threaded crawler is as simple as:

crawlers 10, $starting_url, {
'a' => sub {
# find all links and add them to the queue to be crawled
queue shift->href;
}
};

5. Net::Whois::Raw

Quick & easy whois data gathering – for example:

#!/usr/bin/perl -w
use strict;
use Net::Whois::Raw;

print "Enter domain: ";
my $dom = <STDIN>;
chomp($dom);
my $dominfo = whois($dom);

print $dominfo;

6. WWW::Google::PageRank

WWW::Google::PageRank is a great little PageRank pinger that does exactly what it says on the tin – programmatically fetches the PageRank of any URL passed to it :)

7. Geo::IP

Geo::IP is another simple tool that looks up an IP’s country location – useful for all sorts of SEO tools.

8. Spreadsheet::XLS

Tired of exporting plain old CSV files from your tools? Want to export your shiny new SEO reports in an Excel format? Easy, just use Spreadsheet::XLS – it’s surprisingly simple to generate spreadsheets with multiple tabs, rich formatting and more. There’s also a module in development for the newer XLSX file format.

9. Parallel::ForkManager

Parallel::ForkManager is a simple parallel processing module, which means you can add multi-threading to your code and speed up your scripts and scrapers in seconds.

10. LWP

LWP (Library for WWW in Perl) is perhaps the most well established interface to the web in Perl, with the most used module within it being LWP::UserAgent. However it is perhaps not quite as “plug & play” to use as some of the alternatives like WWW::Mechanize or Scrappy. There is a whole book dedicated to this set of modules – if you’re interested in learning more about scraping and crawling in Perl I’d definitely recommend it.

11. DBI

If you’re going to build SEO tools, you’ll need to interact with a database once you reach a certain level of complexity. DBI is an essential module for interacting with different databases. Most SEOs are probably more familiar with MySQL than other DB types, which DBI handles easily & securely.

APIs

I haven’t mentioned any API modules in this list, although there are around 5,000 listed API modules listed on CPAN, including for Facebook, Twitter, Flickr and other less well known services such as Wordnik.

Have you got any favourites I’ve missed? Share them below!

Remembering Jaamit

Very sad that I couldn’t make it to Ealing today to say goodbye to my friend & colleague Jaamit Durrani, due to this damn freak weather. So instead I wanted to write a little bit about Jaamit as I knew him. I have held off writing anything much until now, but today should be a day for reflection and celebration of his life, and this is the best way I can think of doing that.

The first time I met Jaamit was in March this year; he’d sent in his CV for a position on our SEO team, and as soon as I saw it, I wanted him to join us – it was a great CV, and I was also aware of his presence on twitter, although wasn’t really active on it myself at the time.

He wanted to go for a drink after his first interview to get a feel for the existing team. We were quickly talking about our visions for where we wanted to go with SEO, and his passion for the subject and depth of knowledge shone through immediately. He was one of those very rare people you can be a stranger to one minute, and the next feel like they’re an old friend you’ve known for a long time.

Working with Jaamit for 8 months (albeit separated between buildings) was an absolute privilege, and I have a lot of fond memories of his passion, professionalism, warmth and sense of humour.

He was very proud of getting a speaking slot at the Think Visibility conference in September, and pursuaded our whole SEO crew to follow him up there. As Rishi said, he was a perfectionist, and was working and re-working his presentation on the train up, late into the night, and even during the presentation before his. As always any nerves were unjustified as the reception to his talk was fantastic – for many a highlight of the conference. He may have been a perfectionist but he was also an engaging, passionate public speaker that could easily win over any crowd.

I remember working remotely late one night before a pitch with him, and after asking him several times to send over his slides so far I eventually received them with the disclaimer:

“this deck is a load of crap – just to warn you”

After looking them through I had to write back:

“dude don’t know what you’re talking about – it’s looking pretty darn good IMO”

The slides were all but finished, and yet he continued working on them until he thought they were perfect (which they were of course). His modest response “wow – really?” speaks volumes.

He was genuinely a great SEO, and his generosity towards others in the industry was always forthcoming – never afraid to give, ask for or take advice from others. He took it upon himself to introduce people at conferences and was a relentless networker – he genuinely loved meeting new people and sharing ideas about SEO or pretty much any subject. He enjoyed getting people talking, and he alone briefly introduced me to Shaun Anderson, Rand, Will Critchlow and many others at the 3 conferences we went to together.

The reaction online to his tragic passing has been something to behold – not just because of the tributes people have given, but also because it really is genuinely astounding that one person could mean so much to so many people, both inside of and outside of the SEO industry. Given his passion, achievements, knowledge and understanding of SEO, it’s easy to forget that he was relatively new to the industry, and had a similar effect on people outside of the industry.

RIP Jaamit

Scrape Google Scribe

Google Labs released a new tool called “Scribe” today, which auto-completes sentences based on those Google has found in web pages. Fun to play with as a gimmick, and potentially useful in Gmail and other apps for users, however there’s definitely some very interesting potential uses for us SEOs :)

Read More…

Sending script requests through multiple IPs

One of the more useful things you may want to do with Perl scripts if you’re into crawling websites is to pipe your script’s requests through multiple IP addresses.

This is actually pretty simple when you know how, but doesn’t seem to be documented that well across the web. So the following steps should work pretty well if you’re running an Apache server:

1. Configure Apache

Make sure you’re running the mod_proxy module. Then add the following code to your Apache conf file:

1
2
3
4
5
6
ProxyRequests On
<proxy *>
Order deny,allow
Deny from all
Allow from internal.example.com
</proxy>

2. Install Squid

A dual purpose caching & proxying program it can be installed on RHEL 5 by following these instructions.

Read More…

Report a broken link

Just found a “Page not found” error on the RNIB website – why don’t more 404 pages have this “report a broken link” feature? It shows users you care and gives developers useful information:

Oh and it probably wouldn’t hurt your link profile either to find out & fix these broken links quickly and as a matter of process.

1 2 3 4 5 6 7  Scroll to top