Following up on our recent Robots.txt Builder Tool announcement, I want to talk a bit about how to deal with robots that do not follow the Robots Exclusion standard. I’m sure at least some of us are familliar with the tale of Brett Tabke and his open warfare on robots hammering Webmaster World. I’m not going to go in to it, but he largely solved his problem with rutheless use of Honeypots/Spider Traps.
The basic premise is this:
- Robots follow links
- Good robots obey the robots.txt file. We can control these.
- Bad robots do not. We want to ban these.
- Thus: A bad robot will follow a link to somewhere denied by robots.txt.
Our attack has two distinct sections:
- Catch the robots, and…
- Kill the robots.
Catching a bad robot
To do this, we’ll be creating hidden links around our site and deny access to their destination with a /robots.txt directive. We will then be storing IPs of the bad robots for later use.
Okay, so we need a link on your site which is visible to spiders and not search engines. Matt gives a great tutorial on how to do it on his blog. This is technically cloaking but Google says it’s okay so we’re going to plough right ahead.
What we’re going to do is create a link that isn’t visible to humans, but one that a robot would pick up easily. The anchor text should be invisible, but should someone read the source of use a weird browser it should warn the visitor not to click it. After all, if they do they’ll get banned.
I’m not going to give you precise instructions on this because we want to avoid botwriters using heuristics to avoid honey traps. However, here’s some tips:
- Link to a page that indicates it’s a trap without being obvious.
- Bad: /honeytrap.html, /trap.html, /badrobots.html
- Good: /avoid.html, /dontclick.html, /bad.html
- Use anchor text which gives you a fair warning, eg “Clicking on this link will get you banned”, “This link is to trap b@d sp1d3rs and r0b0tz”.
- Hide your link creatively
- Remember: It must appear in the source as a regular link. The trick is to hide it afterwards
- You can do this with:
- Styles – display: none; perhaps position off the page or underneath something (with z-index)
- Obfuscation – white-on-white text, 1px shim image
- Don’t display the link and tell users not to click it. Have button, will push. Remember, we told Bush not to bomb Iraq.
Rememeber, your link needs some content inside it otherwise most HTML parsers will skip over it.
The Robots.txt File
This bit is really easy. You need to create a robots.txt file inside the root of your website (that is, the top-level directory) which disallows access to the URL you chose. For example, if I decided my link should point to
/badboy.php, my robots.txt file would look like:
User-agent: * Disallow: /badboy.php
You can even use our Robots.txt Builder Tool to help you with this.
Any well-behaved bots should never access /badboy.php from now on. Make sure you upload your robots.txt file before you implement the next section.
I’m going to refer to our link (eg.
/badboy.php) as the spider trap. The rest of this tutorial will refer to
/badboy.php but please do not use this yourself.
Storing the IPs in the Spider Trap
Okay so now you want to make your spider trap. Create the page /badboy.php and open it up in your favourite code editor.
Our PHP for this is really simple, we’re just storing some environment variables in a database. I’m going to assume you can go through the rigmarole of connecting to a database and managing XSS attacks properly yourself. We should probably log a bit more than just the IPs of the bots. I also want to store their User-agent and the datetime that they visited:
< ?php require_once("DB.php"); $db = DB::connect("mysql://user:pass@localhost/database"); if (PEAR::isError($db)) die("Could not connect to database"); // if you don't know what PEAR::DB is I suggest you find out! $db->query("insert into badrobots set ip=?, useragent=?, datetime=!", Array($_SERVER['REMOTE_ADDR'], $_SERVER['HTTP_USER_AGENT'], "now()")); echo "You're nicked, son."; ?>
Don’t forget to add an index on that
ip column in your table.
Now the bad bots will visit this page and get their IP logged. Hurrah!
Banning the bots
So now we want to actually ban our bad bots. This isn’t actually as simple as it sounds. Basically, we have three options:
- Ban the bots with PHP
- Ban the bots with mod_access (Allow from.. Deny from..)
- Completely ban the bots with firewall rules.
I’m going to discuss option #1 in this tutorial. It’s not the best option but it is easily the simplest. You see, with option #1, our server is still accepting the request and firing up a PHP interpreter before the connection is rejected. We’ve also had to connect to a DB and do a read on it. However, both the other options won’t interface with a DB so require manually adding the rules or compiling them periodically. Worse, option #3 could end up with you completely unable to access your own server if it goes tits up. However, it is the only option that will protect your server from a monumental hammering.
Anyway, banning the bots with #1 is dead easy. All you need to do is make sure this following bit of PHP code is execute at the start of every page on your site, as soon after you connect to your database as possible. My DB syntax might be different to yours, but as an experienced website operator I’m sure you can translate, right?
< ?php // connect to DB, etc if ($db->getOne("select count(1) from badrobots where ip=?", Array($_SERVER['REMOTE_ADDR']))) die(' You have been banned from this site for poor robot behaviour. If you think this is in error please contact the server administrator<. '); ?>
And there you have it! You might also want to log bad robot accesses but.. I dunno, up to you.
And that brings us to the end of our tutorial. I hope you enjoyed it! All comments, suggestions and errata to the usual place.
Congratulations to Richard Hearne for being the first to suggest how I would better store an IP in a MySQL database. However, he neglected to mention that ip2long returns the IP as a signed int and needs to be converted with sprintf. Johannes suggested my favourite method of using MySQL’s built-in INET_ATON and INET_NTOA functions.