Robots.txt

17.08.07

How do you stop Google indexing you robots.txt ?

I don’t want users poking around in the Folders I have banned Google from :

take webmasterworld ( actually I’m with Brett on this one )

Brett cloaks his robots.txt :


http://www.webmasterworld.com/robots.txt

But Google has to get the real one :


Google Cached Copy
of WMW robots.txt

How do I stop Google showing my Robots.txt to the world

DaveN

22 Comments

  • 1

    can you not password protect it in th .htaccess file at all? or will that also block the spiders

    TheShaneDJShow
    http://www.shaned

    17th August 2007 @ 13:23

  • 2

    Dave - why don’t you just cloak? or are you more asking in the general sense for less sophisticated webmaster?

    quadszilla
    http://seoblackhat.com

    17th August 2007 @ 13:52

  • 3

    X-Robots HTTP tag ?

    JohnMu

    17th August 2007 @ 13:59

  • 4

    @quadszilla yer more general webmasters and also if you mention Cloaking to a Corp Webmaster they tend to poo their pants a little,

    DaveN

    DaveN

    17th August 2007 @ 14:26

  • 5

    Perhaps disallow robots.txt file in the robots.txt file so the robots.txt file isn’t indexed;-)

    Shimon Sandler
    http://www.shimonsandler.com

    17th August 2007 @ 14:48

  • 6

    It’s not cloaking, it’s personalization!

    JohnMu

    17th August 2007 @ 15:21

  • 7

    The only way to stop Google indexing stuff without exposing yourself is to properly protect the areas you’re hiding.

    That means in library folders use with “deny from all” in a .htaccess, or use some sort of login system (possibly HTTP Basic), or restrict by IP.

    Rob Haswell

    17th August 2007 @ 15:38

  • 8

    If memory serves me correct then oracle.com do simple cloaking based on user agent. Pretty good example to give to corporate client I reckon :)
    Couldn’t you just use x-robots tag in the header to stop Google indexing the file?

    Richard Hearne
    http://www.redcardinal.ie

    17th August 2007 @ 16:34

  • 9

    Double blind also works.

    Use

    Disallow: /foo

    in the robots.txt file.

    Inside that folder set DirectoryIndex to OFF so that none of the filenames can be listed.

    Inside that folder create a folder: /foo/bar/

    Again set DirectoryIndex to off for that sub-folder.

    Put the stuff you don’t want to be found inside that sub-folder.

    g1smd

    18th August 2007 @ 13:25

  • 10

    In robots.txt be moe clever as to what you disallow, in how you define it.

    You have a folder called /scripts/.

    Normally you would put:

    Disallow: /scripts

    in the robots.txt file.

    Instead, call the folder /scripting462782/ or some such number.

    Make sure that is the only folder name that begins with /scr

    In the robots.txt file use

    Disallow: /scr

    That will disallow any URL that begins /scr without exposing the real path name.

    g1smd

    18th August 2007 @ 13:28

  • 11

    Rob Haswell

    20th August 2007 @ 12:12

  • 12

    So anyone find a solution to this?

    Click Input
    http://www.clickinput.com

    20th August 2007 @ 19:45

  • 13

    Is that agreeing with me, or pointing to a problem?

    g1smd

    20th August 2007 @ 20:54

  • 14

    Just put:

    Options -Indexes

    into your htaccess file and it will block folks from getting into any directory that doesn’t have an index file in it.

    DWR

    24th August 2007 @ 09:43

  • 15

    That doesn’t keep your robots file from being Googled, but it does solve the problem of folks “poking around in your directories”, so the initial desire is fullfilled.

    DWR
    http://deleted

    24th August 2007 @ 09:44

  • 16

    Use the if’s in the htaccess if condition user agent blah blah serve the file blah blah…

    Maybe you guys can look up the apache syntax?

    Igor The Troll

    29th August 2007 @ 10:19

  • 17

    But a good proxy will by pass it, so if you want to see Bert’s robot.txt just write a proxy to pretend to be Google…

    Guess JohnMu has one to lend you..

    Igor The Troll

    29th August 2007 @ 10:22

  • 18

    Dave, forget about asking JohnMu for a proxy script, I have just learned he is one of them now…

    So all his hacking has finally payed off and he has been rewarded as the new Googler…

    Well lose some win some.

    Maybe he will make sure none of our sites are deindexed inadvertently by Google. :)
    But if you are interested in reading a cloked robots.txt with a proxy, look into curl_init in PHP

    Igor The Troll

    29th August 2007 @ 10:58

  • 19

    So what is the verdict? Did anyone test and find a perfect solution to this?

    SeLvesTr
    http://newyorkforum.us

    10th October 2007 @ 10:13

  • 20

    You can disallow your robots.txt in your robots.txt:

    user-agent: *
    disallow: /robots.txt

    That will keep it from getting crawled and will prevent a “cached” link from appearing (if it is shown in the index anyway). Google will still be able to access it. If you want to prevent users from seeing it as well, you can use the bot identification setups available for the major search engine bots and only serve them the real file.

    JohnMu

    5th November 2007 @ 11:23

  • 21

    May Just use a 301 redirect.

    led display
    http://www.chipshow.com

    31st May 2008 @ 08:06

  • 22

    I don’t know if you are still looking for the answer to this, but I think i know what he does.

    He enabled the robots.txt as a php or asp script. This is down through the .htaccess file as seen below:

    SetHandler application/x-httpd-php

    This allows PHP code in the robots.txt file to be executed. He simply has a PHP script check to see if the user requesting the file is indeed a google bot, and if so display the actual robots.txt data. If not, go ahead and pull the news posts from the databse and display them for users like you to be confused =P

    I hope this helps! Please pay me a visit or send an email if you appreciated this answer.

    Matt

    GreenWithEnvy
    http://www.console-addicts.com/

    5th February 2009 @ 05:06

Add a Comment

*

*

*

Come and work with David Naylor and the team Subscribe
to the David Naylor feed
Follow
David Naylor's Twitter feed
View Dave's Blog