Dangers of Custom Shortened URLs
If you’re not careful, custom shortened URLs can be dangerous – just like any other user-provided content.
As an example I’m going to pick on bit.ly. They’re not the only site to have this problem, but since being chosen as the default URL shortener on Twitter they are probably the highest profile.
http://bit.ly/robots.txtWhere do you think that link goes? I’d expect it to go to bit.ly’s robots.txt file, defining any parts of their site they don’t want to be crawled or banning some crawlers altogether. Instead, it redirects to someone’s blog – who, in a weird coincidence, links back to one of our sites in his post.
All the owner of that blog would have to do would be to change his post to look like a normal robots.txt file and he could happily ban Google (or Yahoo, or whoever) from crawling any page on bit.ly. You could probably cause a bit of upset by making a Sitemap entry in there that pointed to your own site… I’m not sure if “Noindex:” works in robots.txt but if so that could also be used for mischief.
How does it even work?
You may think that the custom name ‘robots.txt’ shouldn’t have been allowed – bit.ly do not allow ‘.’ to appear in them. However, they will happily strip out any dots in the link and ignore them – so bit.ly/robots.txt is equivalent to bit.ly/robotstxt. Interestingly, this also shows up a bug somewhere in bit.ly. If you click on both of those links, they should take you to the same blog post…
Bit.ly’s Bug
Sometimes clicking those two links will not take you to the same place.
By comparing the information on the two – robots.txt and robotstxt – you can see that they are actually stored separately in their database.
You may also notice where the first one is supposed to redirect to – I’m clearly not the first one to think of it! As I said before, bit.ly remove dots from the URL before they redirect you – but it seems this doesn’t always happen. By comparing two sets of data you could hazard a guess that around a third of the time, trying to fetch the bit.ly robots.txt will redirect you to the first URL (1) while the rest of the time it will send you to the second (2).
What on earth is going on? My guess is that they are doing something like load balancing over at bit.ly and one of their servers isn’t removing dots in the same manner as the others. It’s redirecting to the one in the database with the dot, which I can only assume was added before they put a check in to prevent it. Their load balancer hits that particular internal server for some percentage of requests. Mind you – that’s just a guess, I’d be interested to hear from bit.ly about it.
Example (1):
GET /robots.txt
Host: bit.ly
HTTP/1.1 301 Moved Permanently
Server: nginx/0.7.42
Date: Thu, 24 Sep 2009 09:21:47 GMT
Content-Type: text/html;charset=utf-8
Connection: keep-alive
Content-Length: 131
Location: http://pentabarf.net/bit.ly-robots.txt
Allow: GET, HEAD, POST
This resource has permanently moved to http://pentabarf.net/bit.ly-robots.txt.
Example (2):
GET /robots.txt
Host: bit.ly
HTTP/1.1 301 Moved
Server: nginx/0.7.42
Date: Thu, 24 Sep 2009 09:21:48 GMT
Content-Type: text/html; charset=utf-8
Connection: keep-alive
Location: http://petercoughlin.com/robotstxt-wordpress-plugin/
MIME-Version: 1.0
Content-Length: 314
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>Moved</TITLE>
</HEAD>
<BODY>
<H2>Moved</H2>
<A HREF="http://petercoughlin.com/robotstxt-wordpress-plugin/">The requested URL has moved here.</A>
<P ALIGN=RIGHT><SMALL><I>AOLserver/4.5.1 on http://127.0.0.1:7200</I></SMALL></P>
</BODY>
</HTML>
This post mentions “dangers” and I’ve only covered one problem with one site… having said that this post is long and boring enough already so I think I’ll save the others until tomorrow! For now, the moral of the story is: if you’re running a URL shortener, be careful letting people name their own links!
James
18 Comments
Tim Nash - http://www.timnash.co.uk
That does sort of assume that Google will obey a robots.txt on an alternate domain, via the 301 thinking about our own crawler we actually test to see if a status 200 request is returned on the robots.txt file to save time (since vast majority of sites do not have such files) since it wouldn’t return a 200 return it would simply be ignored. I would assume google would do something similar.
Still valid concern if you have shortner/tos /advertise /user or similar be very easy to generate phising scams or similar which perhaps are a bigger threat.
John
Interesting find but I would agree with tim above, think the robots.txt protocol will only be accepted if the file is an actual .txt file located in the root of the domain?
Atul
They seem to have the same problem with .htaccess and sitemap.xml. Sitemap.xml is used by Google/ssearch engines to deep crawl the site.
DaveN
want to place a small wage .. or just wait for tomorrows post 😉
Tim Nash - http://www.timnash.co.uk
I would be surprised, Google in particular needs to be ultra efficient in its crawl, to waste time by calling what will be empty 404s fully would be inefficient. So they have either deliberately accounted for 3xx replies in which case they would one hopes thought about this potential problem and checked the domain is actually the same place the alternative is that where as we actually code our bot to look specific for a 200 status they may simply code theirs to ignore a 40x but um duh on them if that’s the case. Anyway have setup a series of domains and 301 (as well as a couple of others to see if I can determine which way round the above is) robots.txt now guessing you have done the same.
John
A cliff hanger, how exciting! I’ll stay tuned
alex kessinger - http://blog.alexkessinger.net
Totally agree. I think most bit.ly links are found out in the wild, so google indexes them there. Bit.ly uses a 301 redirect so its almost invisible. There are lots of other things to worry about when it comes too url shorteners, but this is not one of them.
There is a discussion on hacker news about this website.
http://news.ycombinator.com/item?id=841505
SearchCap: The Day In Search, September 24, 2009 - pingback
[…] Dangers of Custom Shortened URLs, David Naylor […]
Kevin - http://pentabarf.net/
Nice find. 😉
James
Kevin: I was hoping you might show up!
Nice idea. Could you just add custom names with dots in a few months ago?
Looks like they have fixed it as of a short time ago.
Bertil Hatt - http://twocroissants.wordpress.com
At least the robot.txt link seems to have been corrected.
Martyn - http://www.webdesign-gm.co.uk
We have been using a Bit.ly for a while now with twitter. I never realised shortened url’s could be such a problem.
Google obeying external REP requests? • Tim Nash UK SEO Blog - pingback
[…] one of the Bronco team wrote an interesting post on the fact Google Crawler was possibly following 301 to Robots.txt file […]
Sebastian - http://sebastians-pamphlets.com/
“Noindex:” works in robots.txt for Googlebot only.
web designers ireland - http://www.bluestar.ie
food for thought indeed…
As if sloppy social media users ain’t bad enough … search engines support traffic theft - pingback
[…] as you can’t avoid URI shortening, roll your own URI shortener and make sure it can’t get abused. For the sake of our children, do not use or support 3rd party URI shorteners. Deprive the […]
Danko - LogoFoo.com - http://www.logofoo.com/
I think bit.ly programmers have spotted this issue and fixed it. The link http://bit.ly/robots.txt is now serving the right path.. 🙂
Sue James - http://the-gardeners-guide.co.uk
Hi all,
good article but still i am unsure on something that puzzles me reference my site. My site is made of affilliates urls through affilliates are very long, therefore i have started shortening them…my question is will this harm my ranking with google or not?
Thanks in advance
Sue