perl

Perl’s HTML::TreeBuilder::XPath is a great module for parsing HTML documents without regular expressions, however it returns text content by default, which is not always what you want when you’re doing advanced HTML processing. The documentation on CPAN doesn’t mention this, but if you want to get out the HTML content, just use “findnodes” and “->shift->as_HTML” in the way illustrated below:

my $value = $tree->findnodes(q{//div[@class='crumbs'})->shift->as_HTML

Dynamically fetch web page contents in Excel

Excel’s built-in web features are pretty frustrating when you want to do more with the web than import a static HTML table to a predefined set of cells.

I’ve often wanted to be able to update the contents of a cell based on dynamic parameters passed into a URL, and not found a decent, easy way of doing this. The official Office website shows you how to do this the Microsoft way, but lo and behold that doesn’t actually translate to real-world uses very well.

Say for example you want to fill a column of cells with the ranking for a given list of keywords, a function similar to that shown below (where the URL could potentially be defined in the E column) would be very useful:

Provided you had an API into ranking data (see Perl code below), this should be an easy operation, but this doesn’t seem to be something Excel does out of the box.

I’m sure there are tons of xll’s and paid for solutions out there to do the same thing, but I want something that’s flexible, and ideally free. Hence the user-defined function below can replace Excel’s awkward built-in parameter handling quite easily and for lots of different uses:

<br />
 &#8216; Name the function anything you like &#8211; the variable parameters to pass in this instance are keyword, domain, number of results<br />
 Public Function getURL(kw As String, domain As String, num As Integer)</p>
<p>&#8216; Build the URL from the parameters passed in<br />
 URL = &quot;http://www.example.com/api/script.pl?kw=&quot; &amp; kw &amp; &quot;&amp;domain=&quot; &amp; domain &amp; &quot;&amp;num=&quot; &amp; num</p>
<p>&#8216; Uncomment to debug the request URL<br />
 &#8216;Debug.Print &quot;Request URL: &quot; &amp; URL</p>
<p>Dim objhttp As Object</p>
<p>Set objhttp = CreateObject(&quot;MSXML2.ServerXMLHTTP.6.0&quot;)</p>
<p>objhttp.Open &quot;GET&quot;, URL, False<br />
 objhttp.setRequestHeader &quot;Content-type&quot;, &quot;application/x-www-form-urlencoded&quot;<br />
 &#8216; If your URL uses basic HTTP authentication (like mine did), uncomment the line below and replace the contents of [] with your base 64 encoded credentials<br />
 &#8216;objhttp.setRequestHeader &quot;Authorization&quot;, &quot;Basic &quot; &amp; &quot;[base64_encode[user:pass]]&quot;<br />
 objhttp.send (&quot;&quot;)</p>
<p>Dim Response As String<br />
 Response = objhttp.responseText</p>
<p>&#8216; To debug the output into Excel&#8217;s error console, uncomment the line below<br />
 &#8216;Debug.Print &quot;GA data feed response: &quot; Response</p>
<p>&#8216; Load contents of URL into the cell that&#8217;s called the function<br />
 getURL = Response</p>
<p>End Function<br />
 

I’m no VBA whiz or anything so I’m sure the above code could be tweaked vastly to make it faster, more robust, and more flexible for variable numbers of query, different query types etc (open to suggestions :) ), but hey, it’s a quick & easy way to solve a problem.

The Perl code: as mentioned above, the VBA code was written specifically to grab ranking data from an API. Hence there’s no fancy XPath expressions in the VBA, or even any attempt to parse the output. This is another potential improvement, but I try and keep Excel’s interactions with the web & text processing to an absolute minimum because Excel + teh internets = sloooow. In this instance, the script output is literally just the numeric data, so any processing is done by the faster Perl code.

The code below relies on a custom scraping library (SearchMarketing::Crawl::GoogleNatural) – I’ll leave that code up to you to re-create ;)

It’ll only check the top 100 results but for most purposes that’s more than enough. If there is more than one result for the given domain, it’ll print the numbers separated by an ampersand and can then be manipulated in Excel if necessary (I’ll just about trust Excel for that…)

<br />
#!/usr/bin/perl -wT<br />
use strict;<br />
use WWW::Mechanize;<br />
use SearchMarketing::Crawl::GoogleNatural;<br />
use CGI qw(:standard);</p>
<p>my $kw          = param(&quot;kw&quot;);<br />
my $num_results = param(&quot;num&quot;);<br />
my $domain      = param(&quot;domain&quot;);</p>
<p>print header;</p>
<p>getPos($kw,$num_results,$domain);</p>
<p>sub getPos {</p>
<p>  my $kw          = $_[0];<br />
  my $num_results = $_[1];<br />
  my $domain      = $_[2];</p>
<p>  my %results = googleUK($kw, $num_results);<br />
  my $count = 0;<br />
  foreach my $position ( keys %results ) {<br />
    my $title    = $results{$position}[0];<br />
    my $snippet  = $results{$position}[1];<br />
    my $dest_url = $results{$position}[2];<br />
    my $disp_url = $results{$position}[3];</p>
<p>    if ($dest_url =~ m{^https?://[^/]*$domain/?.*$}i) {</p>
<p>      if ($count &gt; 0) {<br />
        print &quot;&amp;&quot;;<br />
      }<br />
      print $position;<br />
      $count++;<br />
    }<br />
  }<br />
  if ($count == 0) {<br />
    print &quot;N/A&quot;;<br />
  }<br />
}<br />

Why code?

I wish I had a pound for every time another SEO told me they want to learn a programming language. It seems most SEOs are sure they want to learn PHP, Python or another programming language, but when asked the question “to what end?” the answers generally become less clear.

Because of this I think the following is the reason why a lot of SEOs never end up taking that step:

Let’s face it: teaching yourself a new language is never easy, and it becomes much harder if you don’t actually know why you’re doing it. For this reason most people get frustrated and give up before they hit the red line above and get any significant payoff for investing the time in learning a new language.

When I taught myself Perl it wasn’t directly to do with SEO – in my first job I spent a whole day each week manually editing an HTML newsletter template in Notepad++. I hated it so much I figured there was probably a better way do it so I bought an O’Reilly book, got up 2 hours early every day until I knew the basics and could build a tool to generate the HTML for me. That saved me 5 hours a week of boring tasks and got me nicely into the payoff zone.

What’s your imperative?

Top 11 Perl modules for SEO

One of the truly great things about Perl is CPAN (the Comprehensive Perl Archive Network), which is an immense resource for almost all of the common (and not so common) programming functions you could ever dream of – from the web to graphics and operating system interfaces. Although Python and Ruby are gaining in popularity these days, CPAN is a huge asset to Perl that (as far as I’m aware) has few equals in other languages.

I’ve collected below some of the most useful modules I’ve found from an SEO’s point of view:

1. WWW::Mechanize

WWW::Mechanize is described as “handy web browsing in a Perl object”. It’s an immensely powerful scraping, crawling and HTML parsing tool, and supports cookies, browsing history, proxies, custom headers and more. It’s a subclass of LWP::UserAgent, so many of the functions in that module will also work here. There’s a great FAQ available on CPAN, as well as examples of what you might use it for.

It’s very simple to knock up fairly advanced tools in several lines of code – for example the snippet below will print all of the on-page links from a list of URLs:

#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;

my @urls = qw(http://www.bbc.co.uk/ http://searchtalk.co.uk/);

foreach my $url (@urls) {
  my $mech = WWW::Mechanize->new();
  $mech->get($url);
  $mech->dump_links();
}

The amount of things you can do with this module are pretty much limitless – aside from rendering JavaScript, Flash etc, anything you do in your browser can be automated through the use of this module. For example you can create your own APIs into services such as Google Webmaster Tools or Google Insights where the current API options are limited, and there are many other awesome applications that others have built off the back of this module. For further reading I’d recommend the book Spidering Hacks – most of the examples are out of date now but the concepts are pretty easy to adapt to other websites.

2. HTML::TokeParser

HTML::TokeParser essentially treats an HTML page as a series of “tokens”, rather than plain text that you run regular expressions over. This makes it a lot more robust in handling invalid or inconsistently formatted HTML, and is closer in concept to how search engines treat HTML pages. A mistake many people make is to recommend valid HTML as an SEO recommendation, while the reality is search engines don’t care, because they don’t treat HTML as well-formed XML, and so don’t break when a quotation mark is out of place.

There are a bunch of newer modules out that use XPath selectors to parse HTML, which in my experience are a bit easier to use, though perhaps not quite as powerful.

3. URI

URI is an essential module in manipulating URLs, converting relative URLs to absolute, etc.

4. Scrappy

Finding the XPath/CSS3 selector is easy with Firebug

Scrappy is a truly awesome module that integrates the WWW::Mechanize and Web::Scraper modules to make scraping and crawling even easier. One of the best features is that you can use XPath or CSS3 selectors to extract info from a webpage rather than labouring over increasingly complex regular expressions. It makes crawling a sych as well, and supports multi-threaded crawling for speeding up your scripts. Writing a very basic multi-threaded crawler is as simple as:

crawlers 10, $starting_url, {
'a' => sub {
# find all links and add them to the queue to be crawled
queue shift->href;
}
};

5. Net::Whois::Raw

Quick & easy whois data gathering – for example:

#!/usr/bin/perl -w
use strict;
use Net::Whois::Raw;

print "Enter domain: ";
my $dom = <STDIN>;
chomp($dom);
my $dominfo = whois($dom);

print $dominfo;

6. WWW::Google::PageRank

WWW::Google::PageRank is a great little PageRank pinger that does exactly what it says on the tin – programmatically fetches the PageRank of any URL passed to it :)

7. Geo::IP

Geo::IP is another simple tool that looks up an IP’s country location – useful for all sorts of SEO tools.

8. Spreadsheet::XLS

Tired of exporting plain old CSV files from your tools? Want to export your shiny new SEO reports in an Excel format? Easy, just use Spreadsheet::XLS – it’s surprisingly simple to generate spreadsheets with multiple tabs, rich formatting and more. There’s also a module in development for the newer XLSX file format.

9. Parallel::ForkManager

Parallel::ForkManager is a simple parallel processing module, which means you can add multi-threading to your code and speed up your scripts and scrapers in seconds.

10. LWP

LWP (Library for WWW in Perl) is perhaps the most well established interface to the web in Perl, with the most used module within it being LWP::UserAgent. However it is perhaps not quite as “plug & play” to use as some of the alternatives like WWW::Mechanize or Scrappy. There is a whole book dedicated to this set of modules – if you’re interested in learning more about scraping and crawling in Perl I’d definitely recommend it.

11. DBI

If you’re going to build SEO tools, you’ll need to interact with a database once you reach a certain level of complexity. DBI is an essential module for interacting with different databases. Most SEOs are probably more familiar with MySQL than other DB types, which DBI handles easily & securely.

APIs

I haven’t mentioned any API modules in this list, although there are around 5,000 listed API modules listed on CPAN, including for Facebook, Twitter, Flickr and other less well known services such as Wordnik.

Have you got any favourites I’ve missed? Share them below!

Scrape Google Scribe

Google Labs released a new tool called “Scribe” today, which auto-completes sentences based on those Google has found in web pages. Fun to play with as a gimmick, and potentially useful in Gmail and other apps for users, however there’s definitely some very interesting potential uses for us SEOs :)

Read More…

1 2  Scroll to top