16 Comments
-
- 2
Interesting find but I would agree with tim above, think the robots.txt protocol will only be accepted if the file is an actual .txt file located in the root of the domain?
- 3
They seem to have the same problem with .htaccess and sitemap.xml. Sitemap.xml is used by Google/ssearch engines to deep crawl the site.
- 4
want to place a small wage .. or just wait for tomorrows post

- 5
I would be surprised, Google in particular needs to be ultra efficient in its crawl, to waste time by calling what will be empty 404s fully would be inefficient. So they have either deliberately accounted for 3xx replies in which case they would one hopes thought about this potential problem and checked the domain is actually the same place the alternative is that where as we actually code our bot to look specific for a 200 status they may simply code theirs to ignore a 40x but um duh on them if that’s the case. Anyway have setup a series of domains and 301 (as well as a couple of others to see if I can determine which way round the above is) robots.txt now guessing you have done the same.
- 6
A cliff hanger, how exciting! I’ll stay tuned
- 7
Totally agree. I think most bit.ly links are found out in the wild, so google indexes them there. Bit.ly uses a 301 redirect so its almost invisible. There are lots of other things to worry about when it comes too url shorteners, but this is not one of them.
There is a discussion on hacker news about this website.
- 8
[...] Dangers of Custom Shortened URLs, David Naylor [...]
- 9
Nice find.

- 11
At least the robot.txt link seems to have been corrected.
- 12
We have been using a Bit.ly for a while now with twitter. I never realised shortened url’s could be such a problem.
- 13
[...] one of the Bronco team wrote an interesting post on the fact Google Crawler was possibly following 301 to Robots.txt file [...]
- 14
“Noindex:” works in robots.txt for Googlebot only.
- 15
food for thought indeed…
- 16
[...] as you can’t avoid URI shortening, roll your own URI shortener and make sure it can’t get abused. For the sake of our children, do not use or support 3rd party URI shorteners. Deprive the [...]






That does sort of assume that Google will obey a robots.txt on an alternate domain, via the 301 thinking about our own crawler we actually test to see if a status 200 request is returned on the robots.txt file to save time (since vast majority of sites do not have such files) since it wouldn’t return a 200 return it would simply be ignored. I would assume google would do something similar.
Still valid concern if you have shortner/tos /advertise /user or similar be very easy to generate phising scams or similar which perhaps are a bigger threat.