perl

Perl’s HTML::TreeBuilder::XPath is a great module for parsing HTML documents without regular expressions, however it returns text content by default, which is not always what you want when you’re doing advanced HTML processing. The documentation on CPAN doesn’t mention this, but if you want to get out the HTML content, just use “findnodes” and “->shift->as_HTML” in the way illustrated below:

my $value = $tree->findnodes(q{//div[@class='crumbs'})->shift->as_HTML

Dynamically fetch web page contents in Excel

Excel’s built-in web features are pretty frustrating when you want to do more with the web than import a static HTML table to a predefined set of cells.

I’ve often wanted to be able to update the contents of a cell based on dynamic parameters passed into a URL, and not found a decent, easy way of doing this. The official Office website shows you how to do this the Microsoft way, but lo and behold that doesn’t actually translate to real-world uses very well.

Say for example you want to fill a column of cells with the ranking for a given list of keywords, a function similar to that shown below (where the URL could potentially be defined in the E column) would be very useful:

Provided you had an API into ranking data (see Perl code below), this should be an easy operation, but this doesn’t seem to be something Excel does out of the box.

I’m sure there are tons of xll’s and paid for solutions out there to do the same thing, but I want something that’s flexible, and ideally free. Hence the user-defined function below can replace Excel’s awkward built-in parameter handling quite easily and for lots of different uses:

<br />
 &#8216; Name the function anything you like &#8211; the variable parameters to pass in this instance are keyword, domain, number of results<br />
 Public Function getURL(kw As String, domain As String, num As Integer)</p>
<p>&#8216; Build the URL from the parameters passed in<br />
 URL = &quot;http://www.example.com/api/script.pl?kw=&quot; &amp; kw &amp; &quot;&amp;domain=&quot; &amp; domain &amp; &quot;&amp;num=&quot; &amp; num</p>
<p>&#8216; Uncomment to debug the request URL<br />
 &#8216;Debug.Print &quot;Request URL: &quot; &amp; URL</p>
<p>Dim objhttp As Object</p>
<p>Set objhttp = CreateObject(&quot;MSXML2.ServerXMLHTTP.6.0&quot;)</p>
<p>objhttp.Open &quot;GET&quot;, URL, False<br />
 objhttp.setRequestHeader &quot;Content-type&quot;, &quot;application/x-www-form-urlencoded&quot;<br />
 &#8216; If your URL uses basic HTTP authentication (like mine did), uncomment the line below and replace the contents of [] with your base 64 encoded credentials<br />
 &#8216;objhttp.setRequestHeader &quot;Authorization&quot;, &quot;Basic &quot; &amp; &quot;[base64_encode[user:pass]]&quot;<br />
 objhttp.send (&quot;&quot;)</p>
<p>Dim Response As String<br />
 Response = objhttp.responseText</p>
<p>&#8216; To debug the output into Excel&#8217;s error console, uncomment the line below<br />
 &#8216;Debug.Print &quot;GA data feed response: &quot; Response</p>
<p>&#8216; Load contents of URL into the cell that&#8217;s called the function<br />
 getURL = Response</p>
<p>End Function<br />
 

I’m no VBA whiz or anything so I’m sure the above code could be tweaked vastly to make it faster, more robust, and more flexible for variable numbers of query, different query types etc (open to suggestions :) ), but hey, it’s a quick & easy way to solve a problem.

The Perl code: as mentioned above, the VBA code was written specifically to grab ranking data from an API. Hence there’s no fancy XPath expressions in the VBA, or even any attempt to parse the output. This is another potential improvement, but I try and keep Excel’s interactions with the web & text processing to an absolute minimum because Excel + teh internets = sloooow. In this instance, the script output is literally just the numeric data, so any processing is done by the faster Perl code.

The code below relies on a custom scraping library (SearchMarketing::Crawl::GoogleNatural) – I’ll leave that code up to you to re-create ;)

It’ll only check the top 100 results but for most purposes that’s more than enough. If there is more than one result for the given domain, it’ll print the numbers separated by an ampersand and can then be manipulated in Excel if necessary (I’ll just about trust Excel for that…)

<br />
#!/usr/bin/perl -wT<br />
use strict;<br />
use WWW::Mechanize;<br />
use SearchMarketing::Crawl::GoogleNatural;<br />
use CGI qw(:standard);</p>
<p>my $kw          = param(&quot;kw&quot;);<br />
my $num_results = param(&quot;num&quot;);<br />
my $domain      = param(&quot;domain&quot;);</p>
<p>print header;</p>
<p>getPos($kw,$num_results,$domain);</p>
<p>sub getPos {</p>
<p>  my $kw          = $_[0];<br />
  my $num_results = $_[1];<br />
  my $domain      = $_[2];</p>
<p>  my %results = googleUK($kw, $num_results);<br />
  my $count = 0;<br />
  foreach my $position ( keys %results ) {<br />
    my $title    = $results{$position}[0];<br />
    my $snippet  = $results{$position}[1];<br />
    my $dest_url = $results{$position}[2];<br />
    my $disp_url = $results{$position}[3];</p>
<p>    if ($dest_url =~ m{^https?://[^/]*$domain/?.*$}i) {</p>
<p>      if ($count &gt; 0) {<br />
        print &quot;&amp;&quot;;<br />
      }<br />
      print $position;<br />
      $count++;<br />
    }<br />
  }<br />
  if ($count == 0) {<br />
    print &quot;N/A&quot;;<br />
  }<br />
}<br />

Why code?

I wish I had a pound for every time another SEO told me they want to learn a programming language. It seems most SEOs are sure they want to learn PHP, Python or another programming language, but when asked the question “to what end?” the answers generally become less clear.

Because of this I think the following is the reason why a lot of SEOs never end up taking that step:

Let’s face it: teaching yourself a new language is never easy, and it becomes much harder if you don’t actually know why you’re doing it. For this reason most people get frustrated and give up before they hit the red line above and get any significant payoff for investing the time in learning a new language.

When I taught myself Perl it wasn’t directly to do with SEO – in my first job I spent a whole day each week manually editing an HTML newsletter template in Notepad++. I hated it so much I figured there was probably a better way do it so I bought an O’Reilly book, got up 2 hours early every day until I knew the basics and could build a tool to generate the HTML for me. That saved me 5 hours a week of boring tasks and got me nicely into the payoff zone.

What’s your imperative?

SEO & Perl: An introduction

perl-camelIt’s generally accepted that it’s useful/invaluable for any SEO to know at least one server-side language. I chose Perl after playing around with PHP and finding it didn’t quite fulfill my requirements. Why Perl and not Ruby/Python/C? Perl is well established (22 years) and supported, has an active community, a great library of extensions (modules), and perhaps most importantly for SEO, was built for processing and reporting on textual data (‘Perl’ is Practical Extraction and Reporting Language.)

Perl has a (misguided) reputation for being difficult to learn and producing unreadable and ‘ugly’ code. In reality, as with any other programming language, the language never produces ugly code, the programmer does. However, perhaps Perl does lend itself to ugly code, simply because it’s so damn easy to hack together useful scripts quickly :)

So where’s the link with SEO? Well a few of the reasons below should help to explain:

  • HTML = text. Perl is great at processing text, and by extension, HTML. As search engine optimisers, we deal with HTML on a daily basis – want to see any web page’s meta tags, heading tags, alt text, etc? Stop messing around with the Web Developer toolbar and build a Perl script to do it for you ;)
  • Spidering. With the help of LWP and WWW::Mechanize, powerful spiders can be written with only a few lines of code. The benefits of this should be obvious; suffice to say that if you’ve written a spider or few, maybe you’ll get a slightly better understanding of what Google’s spiders may (or may not) be capable of. Beyond that the number of competitive intelligence tools that can be built around this are limitless.
  • Regular Expression support. Apologies if this is jargon, but regex support is second to none. It’s built into the programming language and is incredibly powerful (PHP borrows ‘Perl-compatible regular expressions’.) This makes processing complex web pages a very easy thing to do.
  • Interaction with the web. There are probably thousands of modules in the CPAN library that make interacting with the web on many levels very, very easy (and advanced.) Incidentally, modern Perl’s UTF-8 support is leagues ahead of PHP’s.
  • Database interaction. To me it’s more logical than PHP (using the DBI module), and has been said to be far more secure.
  • Building websites. I still mainly use PHP for building the front-end to my SEO tools, but that’s just laziness really. Perl has a great embeddable code support for building websites (I just haven’t learned it yet), and the CGI module is great. A lot of the back-end functionality (glue) is written in Perl because it’s quick, reliable, secure, and plays nicely with PHP.

So, where to start if you’re into SEO and want to learn Perl? My first Perl/programming book was Learning Perl (Win32 version actually) – a very smooth introduction to the language, and even good for complete n00bs to programming (I only knew HTML and a bit of PHP when I started on Perl.)

O’Reilly really has the best library of Perl books I’ve found – Spidering Hacks is the most directly relevant for SEO (a lot of the examples are out of date, but if you know a bit of Perl before you start there’s a lot of good ideas here.) The Perl Cookbook is also a fantastic resource.

Perl Ubiquity commands

Inspired by IrishWonder’s excellent set of SEO commands for Ubiquity for the awesome Firefox plugin Ubiquity, I’ve knocked up a couple of Ubiquity commands for Perl users – a useful direct CPAN search function and a search of the Perldoc.org website.

Simply bring up Ubiquity with CTRL-Space and enter ‘help’. Then past the commands below into the text box in the “Command Editor” tab and close the window to use them (commands in the text box will be saved automatically).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
CmdUtils.CreateCommand({
name: "cpan",
icon: "http://search.cpan.org/favicon.ico",
description: "Search the CPAN Perl repository",
homepage: "http://www.rob-hammond.co.uk/",
author: { name: "Rob Hammond", email: "rob@robbiehammond.com"},
license: "GPL",
takes: {URL: noun_arb_text},
preview: function(pblock, directObject) {
searchText = jQuery.trim(directObject.text);
if(searchText.length < 1) {
pblock.innerHTML = "Searches CPAN for modules";
return;
}
var previewTemplate = "Perform a CPAN lookup for <b>${query}</b>";
var previewData = {query: searchText};
pblock.innerHTML = CmdUtils.renderTemplate(previewTemplate, previewData);
},
execute: function(directObject) {
var url = "http://search.cpan.org/search?query={QUERY}&amp;amp;amp;amp;mode=all";
var query = directObject.text;
var urlString = url.replace("{QUERY}", query);
Utils.openUrlInBrowser(urlString);
}
});

CmdUtils.CreateCommand({
name: "perldoc",
icon: "http://perldoc.perl.org/favicon.ico",
description: "Search Perldoc.org",
homepage: "http://www.rob-hammond.co.uk/",
author: { name: "Rob Hammond", email: "rob@robbiehammond.com"},
license: "GPL",
takes: {URL: noun_arb_text},
preview: function(pblock, directObject) {
searchText = jQuery.trim(directObject.text);
if(searchText.length < 1) {
pblock.innerHTML = "Searches Perldoc";
return;
}
var previewTemplate = "Perform a Perldoc lookup for <b>${query}</b>";
var previewData = {query: searchText};
pblock.innerHTML = CmdUtils.renderTemplate(previewTemplate, previewData);
},
execute: function(directObject) {
var url = "http://perldoc.perl.org/search.html?q={QUERY}";
var query = directObject.text;
var urlString = url.replace("{QUERY}", query);
Utils.openUrlInBrowser(urlString);
}
});

If I write any more useful Ubiquity commands or update these I’ll set up a proper subscribe page ;)

 Scroll to top