One of the more useful things you may want to do with Perl scripts if you’re into crawling websites is to pipe your script’s requests through multiple IP addresses.
This is actually pretty simple when you know how, but doesn’t seem to be documented that well across the web. So the following steps should work pretty well if you’re running an Apache server:
1. Configure Apache
Make sure you’re running the mod_proxy module. Then add the following code to your Apache conf file:
1 2 3 4 5 6
| ProxyRequests On
<proxy *>
Order deny,allow
Deny from all
Allow from internal.example.com
</proxy> |
2. Install Squid
A dual purpose caching & proxying program it can be installed on RHEL 5 by following these instructions.
Continued…
Posted in apache, perl, squid.
Just found a “Page not found” error on the RNIB website – why don’t more 404 pages have this “report a broken link” feature? It shows users you care and gives developers useful information:

Oh and it probably wouldn’t hurt your link profile either to find out & fix these broken links quickly and as a matter of process.
Posted in accessibility, usability.
As a new series of the consistently excellent Mad Men kicks off its fourth series in the US, I thought it would be a good time to reflect on a few advertising rules from the gurus at Sterling Cooper that can equally be applied to SEO.
Rule #5: “If you don’t like what’s being said, change the conversation”
Reputation management can be a key part of an SEO’s job, especially as clients learn that Google is their new home page. So how do you deal with negative search results appearing for your brand name?
- Change the conversation – Dell are a fantastic example of this; after the “Dell Hell” episode, Dell turned around their customer support, and with the IdeaStorm website actively showed their customers they were listening. There aren’t a lot of large companies that have the ability or will to turn that kind of thing around.
In SEO there are also a few other bits we can do to address reputation issues:
Posted in seo.
Google’s Product search is lumped into organic referrals in Google Analytics if you don’t use tracking parameters, which can distort the level of true natural search volume you see in any web analytics package.
A typical solution shown on some websites is to add ‘utm_’ tracking variables such as http://example.com/?utm_source=google&utm_medium=products to the URL, which correctly identifies the traffic as being from Google Product search (whether it’s the actual shopping search or a Google onebox). However this solution sacrifices the referring keyword data, and just shows the site as a referring website.
The simple solution to this is to change the utm_medium parameter to ‘organic’ (and the utm_source to ‘googlebase’ or similar), which Google Analytics then treats as a normal organic search engine, and accordingly reports the referring keyword
Update: Tripleox correctly pointed out on twitter that you can still get keyword data with the first method by going to Traffic Sources -> google / products and use the dimensions drop-down menu to select keywords.
Posted in analytics.
There are tons of XML Sitemap Generators out there on the web already, so why would anyone need another one? Well, one of the points of sitemaps is to give search engines a helping hand when they’re crawling and indexing your website.
Stop to think for a second about how these sitemap generator sites actually generate your sitemap. They crawl your site, and produce a nice looking sitemap file. Okay, but how the heck are these sites going to give any extra information to major search engines like Google who have the most advanced crawlers on earth?
The answer is they’re not, which is where the Perl script below comes in. It’s an extremely basic building block which simply accepts a text list of URLs (from “urls.txt”) and throws out an XML Sitemap file containing those URLs.
So how do you get the list of URLs, and what advantage does it hold over a third party generator? You can get the URLs by using your own spider or an off-the-shelf solution like Sphider; the benefit of this is you can specify crawling rules which exclude pages which you don’t want to be indexed or that have no value for search engines.
Continued…
Posted in perl, seo, sitemaps.