If you’re not careful, custom shortened URLs can be dangerous – just like any other user-provided content.
As an example I’m going to pick on bit.ly. They’re not the only site to have this problem, but since being chosen as the default URL shortener on Twitter they are probably the highest profile.http://bit.ly/robots.txt
Where do you think that link goes? I’d expect it to go to bit.ly’s robots.txt file, defining any parts of their site they don’t want to be crawled or banning some crawlers altogether. Instead, it redirects to someone’s blog – who, in a weird coincidence, links back to one of our sites in his post.
All the owner of that blog would have to do would be to change his post to look like a normal robots.txt file and he could happily ban Google (or Yahoo, or whoever) from crawling any page on bit.ly. You could probably cause a bit of upset by making a Sitemap entry in there that pointed to your own site… I’m not sure if “Noindex:” works in robots.txt but if so that could also be used for mischief.
How does it even work?
You may think that the custom name ‘robots.txt’ shouldn’t have been allowed – bit.ly do not allow ‘.’ to appear in them. However, they will happily strip out any dots in the link and ignore them – so bit.ly/robots.txt is equivalent to bit.ly/robotstxt. Interestingly, this also shows up a bug somewhere in bit.ly. If you click on both of those links, they should take you to the same blog post…
Sometimes clicking those two links will not take you to the same place.
By comparing the information on the two – robots.txt and robotstxt – you can see that they are actually stored separately in their database.
You may also notice where the first one is supposed to redirect to – I’m clearly not the first one to think of it! As I said before, bit.ly remove dots from the URL before they redirect you – but it seems this doesn’t always happen. By comparing two sets of data you could hazard a guess that around a third of the time, trying to fetch the bit.ly robots.txt will redirect you to the first URL (1) while the rest of the time it will send you to the second (2).
What on earth is going on? My guess is that they are doing something like load balancing over at bit.ly and one of their servers isn’t removing dots in the same manner as the others. It’s redirecting to the one in the database with the dot, which I can only assume was added before they put a check in to prevent it. Their load balancer hits that particular internal server for some percentage of requests. Mind you – that’s just a guess, I’d be interested to hear from bit.ly about it.
HTTP/1.1 301 Moved Permanently
Date: Thu, 24 Sep 2009 09:21:47 GMT
Allow: GET, HEAD, POST
This resource has permanently moved to http://pentabarf.net/bit.ly-robots.txt.
HTTP/1.1 301 Moved
Date: Thu, 24 Sep 2009 09:21:48 GMT
Content-Type: text/html; charset=utf-8
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<A HREF="http://petercoughlin.com/robotstxt-wordpress-plugin/">The requested URL has moved here.</A>
<P ALIGN=RIGHT><SMALL><I>AOLserver/4.5.1 on http://127.0.0.1:7200</I></SMALL></P>
This post mentions “dangers” and I’ve only covered one problem with one site… having said that this post is long and boring enough already so I think I’ll save the others until tomorrow! For now, the moral of the story is: if you’re running a URL shortener, be careful letting people name their own links!