Skip to content


Sending script requests through multiple IPs

One of the more useful things you may want to do with Perl scripts if you’re into crawling websites is to pipe your script’s requests through multiple IP addresses.

This is actually pretty simple when you know how, but doesn’t seem to be documented that well across the web. So the following steps should work pretty well if you’re running an Apache server:

1. Configure Apache

Make sure you’re running the mod_proxy module. Then add the following code to your Apache conf file:

1
2
3
4
5
6
ProxyRequests On
<proxy *>
Order deny,allow
Deny from all
Allow from internal.example.com
</proxy>

2. Install Squid

A dual purpose caching & proxying program it can be installed on RHEL 5 by following these instructions.

Continued…

Posted in apache, perl, squid.

Report a broken link

Just found a “Page not found” error on the RNIB website – why don’t more 404 pages have this “report a broken link” feature? It shows users you care and gives developers useful information:

Oh and it probably wouldn’t hurt your link profile either to find out & fix these broken links quickly and as a matter of process.

Posted in accessibility, usability.

5 SEO Lessons from Mad Men

As a new series of the consistently excellent Mad Men kicks off its fourth series in the US, I thought it would be a good time to reflect on a few advertising rules from the gurus at Sterling Cooper that can equally be applied to SEO.

Rule #5: “If you don’t like what’s being said, change the conversation”

Reputation management can be a key part of an SEO’s job, especially as clients learn that Google is their new home page. So how do you deal with negative search results appearing for your brand name?

  • Change the conversation – Dell are a fantastic example of this; after the “Dell Hell” episode, Dell turned around their customer support, and with the IdeaStorm website actively showed their customers they were listening. There aren’t a lot of large companies that have the ability or will to turn that kind of thing around.

In SEO there are also a few other bits we can do to address reputation issues:

  • Change Suppress the conversation – just because an isolated voice has a grudge against your company, that doesn’t mean it’s justified. What about the hundreds of satisfied customers who’ve given you testimonials? By its nature, negative press gets more traction than positive press – a bit of link building towards positive reviews can help give your satisfied customers a voice. Of course if you don’t have any satisfied customers, that’s a different story entirely ;)
  • Change Influence the conversation – imagine you’re Microsoft and have just launched a flagship gaming console. People are searching for [xbox reviews] but the first thing Google Suggest throws up is “xbox repair” and “xbox red ring of death”. Search volumes are said to influence the suggestions box, and we know ATL advertising can have a significant effect on search volumes. For a large company like Microsoft, it’s possible a co-ordinated ATL campaign could influence searcher behaviour, if not to remove the RROD suggestion, at least push it down a bit. How about releasing a red Xbox and promoting that; phrases such as [xbox red], [xbox red 120gb], etc may start to populate the suggestions box. This example’s rather far fetched but I think it’s a valid point.
  • It's notoriously difficult to affect Google Suggest, but a creative ATL campaign may insert new suggestions

  • Change Buy the conversation – a dangerous road to go down, in that you’re rewarding your critics, but if there’s a result from a poisonous blog that just won’t budge in the SERPs, approaching them with an offer to buy the site may help remove them. Continued…

Posted in seo.

Show referring keywords from Google Base

Google’s Product search is lumped into organic referrals in Google Analytics if you don’t use tracking parameters, which can distort the level of true natural search volume you see in any web analytics package.

A typical solution shown on some websites is to add ‘utm_’ tracking variables such as http://example.com/?utm_source=google&utm_medium=products to the URL, which correctly identifies the traffic as being from Google Product search (whether it’s the actual shopping search or a Google onebox). However this solution sacrifices the referring keyword data, and just shows the site as a referring website.

The simple solution to this is to change the utm_medium parameter to ‘organic’ (and the utm_source to ‘googlebase’ or similar), which Google Analytics then treats as a normal organic search engine, and accordingly reports the referring keyword :)

Update: Tripleox correctly pointed out on twitter that you can still get keyword data with the first method by going to Traffic Sources -> google / products and use the dimensions drop-down menu to select keywords.

Posted in analytics.

Rudimentary Sitemap Generator

There are tons of XML Sitemap Generators out there on the web already, so why would anyone need another one? Well, one of the points of sitemaps is to give search engines a helping hand when they’re crawling and indexing your website.

Stop to think for a second about how these sitemap generator sites actually generate your sitemap. They crawl your site, and produce a nice looking sitemap file. Okay, but how the heck are these sites going to give any extra information to major search engines like Google who have the most advanced crawlers on earth?

The answer is they’re not, which is where the Perl script below comes in. It’s an extremely basic building block which simply accepts a text list of URLs (from “urls.txt”) and throws out an XML Sitemap file containing those URLs.

So how do you get the list of URLs, and what advantage does it hold over a third party generator? You can get the URLs by using your own spider or an off-the-shelf solution like Sphider; the benefit of this is you can specify crawling rules which exclude pages which you don’t want to be indexed or that have no value for search engines.
Continued…

Posted in perl, seo, sitemaps.