Google Sitemaps OMG
- 18th Nov 2005
- Leave a Comment
- Ebay
Google
Another day, another Google privacy violation
Remember the new sitestats section of Google Sitemaps? In a couple of minutes we’ve found quite a dodgy exploit which sometimes allows you to see the stats of your less web-savvy competitors.
Google requires you to verify that a site is yours by placing a file with a random filename in the root of your sites. However, if you (badly) employ custom 404 messages on your server, you may have instructed your server, inadvertantly, to declare all URLs within your domain as found.
It all depends on the actual server headers found and the way Google interperates them. From our little foray, we’ve concluded:
Not Found:
404 (obviously)
301 and 302 Moved Permanently/Temporarily
Found:
200 - all of
All other 302s (When redirecting to, say, /404.html)
Check out the screenshot of sites we own :-)
And some stats from uk.php.net
So who is at fault for all of this?
Well, we reckon mostly Google for not properly thinking through the whole verificaiton process. All the sites we managed to “0wn” would clearly be 404’s to a well-thought-out system. However, webmasters are also partly to blame for bad server setups - eg, we got Ebay.com because they had misspelled “Permanently” in their 301 header. There are also lots of spammy directories out there which return 200 OK for pretty much anything ending in .html
But perhaps the real question should be: How much do you want to trust Google with your data when they get caught making mistakes such as this? This kind of data generally isn’t too sensitive, but imagine if we put a competitor’s site in there? At very least we’d be able to know exactly what keywords to target.
So go to google sitemaps .. and add aol.com to your account see what happens .. i bet you get their stats :)
Post Script:
A couple of things we’ve found since playing with this:
- 23% of the Alex Top 100 sites are susceptible to this problem
- Other big sites include Orkut, Infoseek Japan, Match.com, Business.com and Whitehouse.gov
- SEW ranks for singingfish
- Most sites we added were using 301 or 302 to another file. We noticed if your 302 or 301 crosses a domain, the site was added
- MSN servers return 200 OK but also “STATUS_CODE: NotFound”, which Google fell over on (”temporary problem”)
- Monster.com should be susceptible but Google’s servers couldn’t resolve it.
And one final little titbit from the Sitemaps FAQ:
8. What is being done to protect my privacy?
We use the verification process to keep unauthorized users from seeing detailed statistics about your site. Only you can see these details, and only once we verify you own the site. We don’t use the verification file we ask you to create for any purpose other than to make sure you can upload files to the site.









30 Comments | Leave a comment »
Nice catch Dave.
Sorry for replying so late, I was too busy (ab)using the system.
Unfortunately, it doesn’t give much info on the old stats but as you rightly say it does assist with the keyword research, but not by much.
Google, FIX THIS ASAP :)
Actually — this is NOT an issue on Googles side: these servers are just not returning the standard 404 result code for bad URLS (or 403, 410). If your server does NOT do that, Google Sitemaps WILL validate automatically. Also — what information do you get in Google Sitemaps statistics that you cannot already get or guess? :-)
Check your server result codes with this: http://gsitecrawler.com/tools/Server-Status.aspx
or read about the 404/200 issue: http://gsitecrawler.com/articles/error-404-200.asp
Cheers
John
Does the file need to stick around? Once it’s been verified and you have access to the stats, can I remove the file? Then would you not get access and I still get stats?
With the security flaw I was able to see the Sitemap stats for http://www.whitehouse.gov. Most curious of all was the “Top search queries” & “Top search query clicks” for the White House website, which are as below:
Thomas : the file was never there.
Google sitemaps only thinks the file is there thats all.
DaveN
With the security flaw I was able to see the Sitemap stats for http://www.whitehouse.gov. Most curious of all was the “Top search queries” & “Top search query clicks” for the White House website, which are as below:
Top search queries
1. failure
2. w
3. failure
4. house
5. bush
Top search query clicks
1. failure
2. failure
3. white house
4. abraham lincoln
5. george washington
Google obviously didn’t intend for their verification to be so lax and it obviously needs fixed, but people have to be reminded, this is google’s data, and they can give whoever they want access to it. Its not a privacy issue as you don’t own the top keywords data. Google do and its theirs to do what they want with.
Ciaron Nixon ….. WHAT !!!
8. What is being done to protect my Privacy?
We use the verification process to keep unauthorized users from seeing detailed statistics about your site. Only you can see these details, and only once we verify you own the site. We don’t use the verification file we ask you to create for any purpose other than to make sure you can upload files to the site.
Ciaron, you’re missing the point. It is a privacy issue given that Google promised to preserve the privacy of the data as outlined in the post above. You could have been using this system for anything, having really sensitive stuff in it.
just imagine if google had “Remove Page” links in sitemaps..
DaveN
It feels like the Microsoft thing, launch it then fix it later, this sux, I was worried about what Google wants to learn from our stats but this appears to be a MUCH bigger problem. Thanks for lookin out for us David.
Verification is just the first step to a real two-way communication between webmasters and search-engines. This is just a bump in the road - wait until you can do more than just see statistics :-). A new league of Google-Hackers will arrive once there are more possibilities after validation.
The privacy policy is of course an issue, but all this information can be gathered simply and freely just by crawling a site and using Google for the rest. Perhaps Google should just change the privacy policy and leave it at that?
so the big question for me is Dave, once verified, can you still get into uk.php.net or aol.com?
This issue has just been fixed. I tried this with one of our own sites that I know returns a 200 and got the following message from Google:
panini: Sorry, we did all this using my personal Google Account, and so I removed all the sites once we’d got screenshots. I kinda like being able to pick up my Gmail…
[laughing] fair enough! DaveN M. thanks for letting me know - One of my sites was open to this and I’d really like to know if it’s still verified on any of my competitors accounts (if they were quick of course)… i know that none of the valid sites i verified have suddenly stopped working or are being asked for verification again…. so why should theirs?… bit of a major screw up for google this… maybe they really are a competitor for microsoft in more ways than we think….
Does this continue to work if you’ve removed the verification file from your server, because then I might imagine that a reverification wouldn’t work?
It’s interesting to note that Google considers uk.PHP.net’s PageRank of 7 to be ‘medium’ and the rest of their pages have a ‘low’ PageRank.
Does anyone need any further proof that PageRank is codswallop??
Ah, you misunderstand. I was not trying to justify Google’s obviously lack of foresight here, I was commenting on the fact that a lot of people (not you DaveN) seem to be jumping on the Google bashing bandwagon under the impression that somehow they have a right to exclusive access to Google’s click through data. After all, Google could turn around tomorrow and sell the data to any site straight to the highest bidder (not likely I know but it illustrates my point)
I remember a similar thing happened when creting a blog. Better not allow other even trusted parties into your server.
Wouldn’t it have been a terribly simple thing for Google to require specific content *and* a specific filename?
That way, Google could safely follow 3xx redirects (or even handle a 200 from a badly misconfigured server) without incorrectly verifying the site.
Looking at this from a development point of view… that’s really shoddy design on Google’s part… surprisingly bad for a company of Google’s size. Particularly for an enterprise that grew out of almost entirely web-based apps, Google should really know better.
John’s article here is pretty useful as is his error tool within it.
also (2nd hand report i know) but sounds like they knew about it already?
John’s article here is pretty useful as is the simple checking tool within it.
also (2nd hand report i know) but sounds like they knew about it already to me
Is this exploit still active or has it been fixed?
Looks fixed; I’m getting the same ‘We’ve detected that your 404 (file not found) error page returns a status of 200 (OK) in the header.’ message as reported above. Damn that sleep thing- looks like I missed out!
Right i’m sorry to keep banging on about this, but having removed the verification files from sites i own and have verified, they still work.
Surely this means that if anyone has got access to a site they shouldn’t have that they can still get to it?????
ok - they are saying they re-verified
http://sitemaps.blogspot.com/2005/11/site-verification.html
Welcome to my pen testing toolkit :D
Wow. Very interesting. I have not heard anyone else discuss this before. I guess this is old news now though.
“All the sites we managed to “0wn” would clearly be 404’s to a well-thought-out system. However, webmasters are also partly to blame for bad server setups” that i agree on as i had discovered that the hard way