Blog

Google Sitemaps OMG

by

Another day, another Google privacy violation

Remember the new sitestats section of Google Sitemaps? In a couple of minutes we’ve found quite a dodgy exploit which sometimes allows you to see the stats of your less web-savvy competitors.

Google requires you to verify that a site is yours by placing a file with a random filename in the root of your sites. However, if you (badly) employ custom 404 messages on your server, you may have instructed your server, inadvertantly, to declare all URLs within your domain as found.

It all depends on the actual server headers found and the way Google interperates them. From our little foray, we’ve concluded:

Not Found:

404 (obviously)
301 and 302 Moved Permanently/Temporarily

Found:

200 – all of
All other 302s (When redirecting to, say, /404.html)

Check out the screenshot of sites we own :-)

And some stats from uk.php.net

So who is at fault for all of this?

Well, we reckon mostly Google for not properly thinking through the whole verificaiton process. All the sites we managed to “0wn” would clearly be 404′s to a well-thought-out system. However, webmasters are also partly to blame for bad server setups – eg, we got Ebay.com because they had misspelled “Permanently” in their 301 header. There are also lots of spammy directories out there which return 200 OK for pretty much anything ending in .html

But perhaps the real question should be: How much do you want to trust Google with your data when they get caught making mistakes such as this? This kind of data generally isn’t too sensitive, but imagine if we put a competitor’s site in there? At very least we’d be able to know exactly what keywords to target.

So go to google sitemaps .. and add aol.com to your account see what happens .. i bet you get their stats :)

Post Script:

A couple of things we’ve found since playing with this:

  • 23% of the Alex Top 100 sites are susceptible to this problem
  • Other big sites include Orkut, Infoseek Japan, Match.com, Business.com and Whitehouse.gov
  • SEW ranks for singingfish
  • Most sites we added were using 301 or 302 to another file. We noticed if your 302 or 301 crosses a domain, the site was added
  • MSN servers return 200 OK but also “STATUS_CODE: NotFound”, which Google fell over on (“temporary problem”)
  • Monster.com should be susceptible but Google’s servers couldn’t resolve it.

And one final little titbit from the Sitemaps FAQ:

8. What is being done to protect my privacy?

We use the verification process to keep unauthorized users from seeing detailed statistics about your site. Only you can see these details, and only once we verify you own the site. We don’t use the verification file we ask you to create for any purpose other than to make sure you can upload files to the site.

30 Comments

  • Jason Duke 2274 days ago

    http://www.strangelogic.com

    Nice catch Dave.

    Sorry for replying so late, I was too busy (ab)using the system.

    Unfortunately, it doesn’t give much info on the old stats but as you rightly say it does assist with the keyword research, but not by much.

    Google, FIX THIS ASAP :)

    Reply
  • John 2274 days ago

    http://gsitecrawler.com

    Actually — this is NOT an issue on Googles side: these servers are just not returning the standard 404 result code for bad URLS (or 403, 410). If your server does NOT do that, Google Sitemaps WILL validate automatically. Also — what information do you get in Google Sitemaps statistics that you cannot already get or guess? :-)

    Check your server result codes with this: http://gsitecrawler.com/tools/Server-Status.aspx
    or read about the 404/200 issue: http://gsitecrawler.com/articles/error-404-200.asp

    Cheers
    John

    Reply
  • Thomas 2274 days ago

    http://www.twistermc.com

    Does the file need to stick around? Once it’s been verified and you have access to the stats, can I remove the file? Then would you not get access and I still get stats?

    Reply
  • ZoominZoomout 2274 days ago

    With the security flaw I was able to see the Sitemap stats for http://www.whitehouse.gov. Most curious of all was the “Top search queries” & “Top search query clicks” for the White House website, which are as below:

    Top search queries
    1. failure
    2. w
    3. failure
    4. house
    5. bush
    Top search query clicks
    1. failure
    2. failure
    3. white house
    4. abraham lincoln
    5. george washington
    Reply
  • DaveN 2274 days ago

    Thomas : the file was never there.

    Google sitemaps only thinks the file is there thats all.

    DaveN

    Reply
  • ZoominZoomout 2274 days ago

    With the security flaw I was able to see the Sitemap stats for http://www.whitehouse.gov. Most curious of all was the “Top search queries” & “Top search query clicks” for the White House website, which are as below:

    Top search queries
    1. failure
    2. w
    3. failure
    4. house
    5. bush

    Top search query clicks
    1. failure
    2. failure
    3. white house
    4. abraham lincoln
    5. george washington

    Reply
  • Ciaron Nixon 2274 days ago

    Ciaron Nixon

    Google obviously didn’t intend for their verification to be so lax and it obviously needs fixed, but people have to be reminded, this is google’s data, and they can give whoever they want access to it. Its not a privacy issue as you don’t own the top keywords data. Google do and its theirs to do what they want with.

    Reply
  • DaveN 2274 days ago

    Ciaron Nixon ….. WHAT !!!

    8. What is being done to protect my Privacy?

    We use the verification process to keep unauthorized users from seeing detailed statistics about your site. Only you can see these details, and only once we verify you own the site. We don’t use the verification file we ask you to create for any purpose other than to make sure you can upload files to the site.

    Reply
  • Danny Sullivan 2274 days ago

    http://blog.searchenginewatch.com

    Ciaron, you’re missing the point. It is a privacy issue given that Google promised to preserve the privacy of the data as outlined in the post above. You could have been using this system for anything, having really sensitive stuff in it.

    Reply
  • DaveN 2274 days ago

    just imagine if google had “Remove Page” links in sitemaps..

    DaveN

    Reply
  • Aaron Pratt 2274 days ago

    http://www.seobuzzbox.com

    It feels like the Microsoft thing, launch it then fix it later, this sux, I was worried about what Google wants to learn from our stats but this appears to be a MUCH bigger problem. Thanks for lookin out for us David.

    Reply
  • John 2274 days ago

    http://gsitecrawler.com

    Verification is just the first step to a real two-way communication between webmasters and search-engines. This is just a bump in the road – wait until you can do more than just see statistics :-) . A new league of Google-Hackers will arrive once there are more possibilities after validation.

    The privacy policy is of course an issue, but all this information can be gathered simply and freely just by crawling a site and using Google for the rest. Perhaps Google should just change the privacy policy and leave it at that?

    Reply
  • panini 2274 days ago

    so the big question for me is Dave, once verified, can you still get into uk.php.net or aol.com?

    Reply
  • Paul P 2274 days ago

    This issue has just been fixed. I tried this with one of our own sites that I know returns a 200 and got the following message from Google:

    NOT VERIFIED
    We’ve detected that your 404 (file not found) error page returns a status of 200 (OK) in the header.

    Reply
  • DaveN Minion 2274 days ago

    panini: Sorry, we did all this using my personal Google Account, and so I removed all the sites once we’d got screenshots. I kinda like being able to pick up my Gmail…

    Reply
  • panini 2274 days ago

    [laughing] fair enough! DaveN M. thanks for letting me know – One of my sites was open to this and I’d really like to know if it’s still verified on any of my competitors accounts (if they were quick of course)… i know that none of the valid sites i verified have suddenly stopped working or are being asked for verification again…. so why should theirs?… bit of a major screw up for google this… maybe they really are a competitor for microsoft in more ways than we think….

    Reply
  • Elliott Back 2274 days ago

    http://elliottback.com/wp/

    Does this continue to work if you’ve removed the verification file from your server, because then I might imagine that a reverification wouldn’t work?

    Reply
  • AccuraCast SEO 2274 days ago

    http://www.accuracast.com

    It’s interesting to note that Google considers uk.PHP.net’s PageRank of 7 to be ‘medium’ and the rest of their pages have a ‘low’ PageRank.

    Does anyone need any further proof that PageRank is codswallop??

    Reply
  • Ciaron Nixon 2274 days ago

    Ah, you misunderstand. I was not trying to justify Google’s obviously lack of foresight here, I was commenting on the fact that a lot of people (not you DaveN) seem to be jumping on the Google bashing bandwagon under the impression that somehow they have a right to exclusive access to Google’s click through data. After all, Google could turn around tomorrow and sell the data to any site straight to the highest bidder (not likely I know but it illustrates my point)

    Reply
  • Orlando 2274 days ago

    Orlando

    I remember a similar thing happened when creting a blog. Better not allow other even trusted parties into your server.

    Reply
  • Matthew Murphy 2273 days ago

    Wouldn’t it have been a terribly simple thing for Google to require specific content *and* a specific filename?

    That way, Google could safely follow 3xx redirects (or even handle a 200 from a badly misconfigured server) without incorrectly verifying the site.

    Looking at this from a development point of view… that’s really shoddy design on Google’s part… surprisingly bad for a company of Google’s size. Particularly for an enterprise that grew out of almost entirely web-based apps, Google should really know better.

    Reply
  • panini 2273 days ago

    John’s article here is pretty useful as is his error tool within it.

    also (2nd hand report i know) but sounds like they knew about it already?

    Reply
  • panini 2273 days ago

    John’s article here is pretty useful as is the simple checking tool within it.

    also (2nd hand report i know) but sounds like they knew about it already to me

    Reply
  • Purple Cow SEO South Africa 2273 days ago

    http://www.purplecow.co.za

    Is this exploit still active or has it been fixed?

    Reply
  • Russ 2273 days ago

    http://www.frenzieddaddy.com/

    Looks fixed; I’m getting the same ‘We’ve detected that your 404 (file not found) error page returns a status of 200 (OK) in the header.’ message as reported above. Damn that sleep thing- looks like I missed out!

    Reply
  • panini 2271 days ago

    Right i’m sorry to keep banging on about this, but having removed the verification files from sites i own and have verified, they still work.

    Surely this means that if anyone has got access to a site they shouldn’t have that they can still get to it?????

    Reply
  • panini 2271 days ago

    ok – they are saying they re-verified

    http://sitemaps.blogspot.com/2005/11/site-verification.html

    Reply
  • Techokami 2271 days ago

    Welcome to my pen testing toolkit :D

    Reply
  • Jeff 1795 days ago

    http://www.adwordanalyzer.com

    Wow. Very interesting. I have not heard anyone else discuss this before. I guess this is old news now though.

    Reply
  • Web Design Ireland 1776 days ago

    http://www.tophatsolutions.ie

    “All the sites we managed to “0wn” would clearly be 404’s to a well-thought-out system. However, webmasters are also partly to blame for bad server setups” that i agree on as i had discovered that the hard way

    Reply

Write your comment

Optional

The Bronco Family
Work With Us