Author Archive

Soliciting opinion: Google Website Optimizer

Hey, Rob here.

We’ve recently been toying with the idea of producing a multivariate testing tool in light of a few technical and otherwise problems with Google’s offering. In terms of conversions, it’s money for nothing.

So, does anyone have any comments, suggestions or grievances with the mutlivariate testing tools out there on the market today? If we can come up with enough support and ideas then it might be a project we’re interested in.

On that note, if we were to do anything like this and if you would like to help beta-test the new service, send me a personal E-mail and I’ll get back to you. We’ve got a window coming up in the next few days so we’ll be starting sometime after then. We’re after people with reasonably-trafficked sites that can track conversions based on visiting a specific page, such as an enquiry form or a checkout.

Thanks!

-Rob

Relevancy: Google vs Yahoo

Look at this:

http://www.google.com/search?&q=wordpress%20theme%20writing&sourceid=firefox
http://search.yahoo.com/search?p=wordpress+theme+writing&ei=UTF-8&fr=

I mean what the fuck is going on there? Clearly Yahoo’s top few results are the ones I’m after. Sort it out Google.

Remind me again why we’re panding to these people? This wasn’t the first search I ran by the way. After a few different keyword searches in Google (to no avail) I plugged the last one I used into Yahoo and got there straight away.

And you know what? It happens time, and time, and time, and time, and time again. I swear G’s still my default search because they waste so much of my time I don’t have any left to fix my bookmarks.

Yours,

-Pissed Off Rob

PHP closures and a quick Debian tip

Dave’s away and I get to indiscriminately litter his blog with posts, so I just wanted to mention something that got me a bit excited a few days ago.

I subscribe to the DevZone RSS feed so I get the (daily) Zend Weekly Summaries. A few days ago they reported on a conversation about anonymous functions. Now, you should really read TFA to get the full picture, but basically this is how anonymous functions (don’t) work in PHP at the moment:

[php]
< ?php
$arr_plus_one = array_map(create_function("int", "return ++$int;"), $arr);
?>
[/php]

Bollocks, right? Well the proposal is for a new syntax for this which would bring PHP much more in line with modern languages like JavaScript:

[php]
< ?php
$arr_plus_one = array_map(function($int) { return ++$int; }, $arr);
?>
[/php]

Much better, yeah? Well the discussion is really more about whether this function should become full-blown closure support in PHP, rather than just a new anonymous function syntax. I just wanted to put it out there and say that I strongly believe Zend should implement full closure support for PHP 6, even if the scoping rules are dodgy. I know newbies are going to be confused by the scoping rules at first but you don’t need to use closures if you don’t understand them. It will also bring PHP a lot closer to being a modern programming language. As it stands PHP is just a fluffy C with just as dodgy OOP. It’s great and I love it and don’t get me wrong I won’t write a web app in anything else, it’s just that it’s a bit frustrating writing beautiful JS and Python then having to go back to PHP :-)

I mentioned Debian didn’t I? Here’s a quick tip that doesn’t warrant another post: If you administrate a bunch of Debian servers, look at the apticron package:

apticron report [Fri, 18 May 2007 06:25:09 +0100]
========================================================================

apticron has detected that some packages need upgrading on:

        ganesh.bronco.co.uk
        [ 192.168.0.18 ]

The following packages are currently pending an upgrade:

        xfree86-common 4.3.0.dfsg.1-14sarge4
        libice6 4.3.0.dfsg.1-14sarge4
        libsm6 4.3.0.dfsg.1-14sarge4
        libxext6 4.3.0.dfsg.1-14sarge4
        libxt6 4.3.0.dfsg.1-14sarge4
        libdps1 4.3.0.dfsg.1-14sarge4
        xlibs-data 4.3.0.dfsg.1-14sarge4
        libx11-6 4.3.0.dfsg.1-14sarge4
        libxmu6 4.3.0.dfsg.1-14sarge4
        libxpm4 4.3.0.dfsg.1-14sarge4
        libxaw7 4.3.0.dfsg.1-14sarge4
        smbfs 3.0.14a-3sarge6
        samba 3.0.14a-3sarge6
        samba-common 3.0.14a-3sarge6
        samba-doc 3.0.14a-3sarge6
        xterm 4.3.0.dfsg.1-14sarge4
        xutils 4.3.0.dfsg.1-14sarge4

Sweet yeah? It also goes on to say exactly what the updates contain, which is great. Get it!

Would anyone like some free backlinks?

Steady, Matt. We’re not selling them so it’s okay, right? Actually I won’t even be providing them. It’s all down to the good folks at PHP.

Some of us might remember the Month of PHP Bugs in March, which I have to say passed without great fanfare. I think it’s probably because it made us all look bad so less said about that the better. Anyway I was reviewing today’s server patches (via the magical apticron utility) which reminded me that I should probably review the results of the MOPB. Boy am I glad I did!

Take a look at this little doozy

Basically, it’s an XSS vulnerability in the phpinfo() function which gives unescaped output for all user-submitted arrays in GET, POST and Cookies.

Translation?

Well if anyone has a spare phpinfo() for PHP versions 4.4.3 -> 4.4.6 hanging about, try appending this to its URL:

?f[]=%3Ca%20href%3Dhttp%3A//www.davidnaylor.co.uk/%3EDaveN%20Ownz%20j00%3C/a%3E

Then scroll down to “PHP Variables”. If you have an exploitable version, you should get one, clean, un-condomned backlink. Ain’t that precious? So all you would need to do is to get a bunch of them indexed and you’re happy as Larry. However happy he is.

Now would anyone like 60,600 free backlinks?

PS. For those that don’t get it yet, this post was written by Rob, one of Dave’s programmers. In Vim. Proudly.

Stopping bad robots with honeytraps

Following up on our recent Robots.txt Builder Tool announcement, I want to talk a bit about how to deal with robots that do not follow the Robots Exclusion standard. I’m sure at least some of us are familliar with the tale of Brett Tabke and his open warfare on robots hammering Webmaster World. I’m not going to go in to it, but he largely solved his problem with rutheless use of Honeypots/Spider Traps.

The basic premise is this:

  1. Robots follow links
  2. Good robots obey the robots.txt file. We can control these.
  3. Bad robots do not. We want to ban these.
  4. Thus: A bad robot will follow a link to somewhere denied by robots.txt.

Our attack has two distinct sections:

  1. Catch the robots, and…
  2. Kill the robots.

Catching a bad robot

To do this, we’ll be creating hidden links around our site and deny access to their destination with a /robots.txt directive. We will then be storing IPs of the bad robots for later use.

As usual for my posts on David Naylor we’ll be assuming a Linux, Apache, MySQL and PHP (LAMP) setup. However, the technique is really quite simple and is easily adaptable to your stack of choice.

The Link

Okay, so we need a link on your site which is visible to spiders and not search engines. Matt gives a great tutorial on how to do it on his blog. This is technically cloaking but Google says it’s okay so we’re going to plough right ahead.

What we’re going to do is create a link that isn’t visible to humans, but one that a robot would pick up easily. The anchor text should be invisible, but should someone read the source of use a weird browser it should warn the visitor not to click it. After all, if they do they’ll get banned.

I’m not going to give you precise instructions on this because we want to avoid botwriters using heuristics to avoid honey traps. However, here’s some tips:

  • Link to a page that indicates it’s a trap without being obvious.
    • Bad: /honeytrap.html, /trap.html, /badrobots.html
    • Good: /avoid.html, /dontclick.html, /bad.html
  • Use anchor text which gives you a fair warning, eg “Clicking on this link will get you banned”, “This link is to trap b@d sp1d3rs and r0b0tz”.
  • Hide your link creatively
    • Remember: It must appear in the source as a regular link. The trick is to hide it afterwards
    • You can do this with:
      • Styles - display: none; perhaps position off the page or underneath something (with z-index)
      • Obfuscation - white-on-white text, 1px shim image
      • JavaScript: I don’t recommend this.
      • Don’t display the link and tell users not to click it. Have button, will push. Remember, we told Bush not to bomb Iraq.

Rememeber, your link needs some content inside it otherwise most HTML parsers will skip over it.

The Robots.txt File

This bit is really easy. You need to create a robots.txt file inside the root of your website (that is, the top-level directory) which disallows access to the URL you chose. For example, if I decided my link should point to /badboy.php, my robots.txt file would look like:

[code]
User-agent: *
Disallow: /badboy.php
[/code]

You can even use our Robots.txt Builder Tool to help you with this.

Any well-behaved bots should never access /badboy.php from now on. Make sure you upload your robots.txt file before you implement the next section.

I’m going to refer to our link (eg. /badboy.php) as the spider trap. The rest of this tutorial will refer to /badboy.php but please do not use this yourself.

Storing the IPs in the Spider Trap

Okay so now you want to make your spider trap. Create the page /badboy.php and open it up in your favourite code editor.

Our PHP for this is really simple, we’re just storing some environment variables in a database. I’m going to assume you can go through the rigmarole of connecting to a database and managing XSS attacks properly yourself. We should probably log a bit more than just the IPs of the bots. I also want to store their User-agent and the datetime that they visited:

[php]
< ?php
require_once(”DB.php”);
$db = DB::connect(”mysql://user:pass@localhost/database”);
if (PEAR::isError($db)) die(”Could not connect to database”);

// if you don’t know what PEAR::DB is I suggest you find out!
$db->query(”insert into badrobots set ip=?, useragent=?, datetime=!”,
Array($_SERVER[’REMOTE_ADDR’], $_SERVER[’HTTP_USER_AGENT’], “now()”));

echo “You’re nicked, son.”;
?>
[/php]

Don’t forget to add an index on that ip column in your table.

Now the bad bots will visit this page and get their IP logged. Hurrah!

Banning the bots

So now we want to actually ban our bad bots. This isn’t actually as simple as it sounds. Basically, we have three options:

  1. Ban the bots with PHP
  2. Ban the bots with mod_access (Allow from.. Deny from..)
  3. Completely ban the bots with firewall rules.

I’m going to discuss option #1 in this tutorial. It’s not the best option but it is easily the simplest. You see, with option #1, our server is still accepting the request and firing up a PHP interpreter before the connection is rejected. We’ve also had to connect to a DB and do a read on it. However, both the other options won’t interface with a DB so require manually adding the rules or compiling them periodically. Worse, option #3 could end up with you completely unable to access your own server if it goes tits up. However, it is the only option that will protect your server from a monumental hammering.

Anyway, banning the bots with #1 is dead easy. All you need to do is make sure this following bit of PHP code is execute at the start of every page on your site, as soon after you connect to your database as possible. My DB syntax might be different to yours, but as an experienced website operator I’m sure you can translate, right?

[php]
< ?php
// connect to DB, etc
if ($db->getOne(”select count(1) from badrobots where ip=?”, Array($_SERVER[’REMOTE_ADDR’])))
die(’

You have been banned from this site for poor robot behaviour. If you think this is in error please contact the server administrator.

‘);
?>
[/php]

And there you have it! You might also want to log bad robot accesses but.. I dunno, up to you.

And that brings us to the end of our tutorial. I hope you enjoyed it! All comments, suggestions and errata to the usual place.

Congratulations to Richard Hearne for being the first to suggest how I would better store an IP in a MySQL database. However, he neglected to mention that ip2long returns the IP as a signed int and needs to be converted with sprintf. Johannes suggested my favourite method of using MySQL’s built-in INET_ATON and INET_NTOA functions.

+ Advertise Here