The robots.txt file

Collapse
This topic is closed.
X
This is a sticky topic.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • VodaHost
    General & Forum Administrator

    • Mar 2005
    • 12356

    The robots.txt file

    In earlier posts we have looked at length about how to give your website the best possible exposure on Google as well as on the other search engines and we have looked at the best ways to SEO (Search Engine Optimise) your site. It has been a great deal of work on your part to make sure that your website is accessible to Google and its Googlebot, that there are plenty of keywords, plenty of quality links and a sitemap for it to follow. Today however we are not making your website more accessible to the Googlebot and the other search engine spiders. Quite the opposite...

    Today we will be discussing the unthinkable; how to keep search engine spiders off your website or restrict them so they can only look at (or, index) parts of your website. It may feel strange to you to have done so much SEO work only to hide it or parts of it. In this article we will be looking at the anti-sitemap: the robots.txt file (or “Robot Exclusion Standard / Robots Exclusion Protocol” if you are a fan of particularly long phrases...).

    GOOD BOTS
    The robots.txt file is the opposite to your sitemap and exists to stop cooperating web spiders visiting all or part of your website (because it exists to tell them where they cannot go). It was started in the summer 1994 by agreement of the members of the robots mailing list because, quite simply, it seemed like a good idea. It was made more popular by Alta Vista, then the other big search engines caught on in the following years and started using the robots.txt standard too.

    While it may seem that we are actually hurting ourselves by not letting web crawlers/ spiders/ robots look at our website in its entirety, this is actually not the case. There may be pages on your website that, while essential, do not actually help the SEO of your website. It might be a sales page that does not contain any of your keywords (maybe only: “Click Here To Confirm” or “Enter Your Credit Card Details”) and letting a robot look at those pages means a worse ranking on Google (more content; fewer keywords).

    The information that you should be restricting using the robots.txt file is information that does not help in any way towards the SEO of your website, but we’ll discuss that again later.

    So, let’s create a robots.txt file for your website...
    It’s a simple plain text file (.txt), so we can create one using the most basic tools on your home computer. You should note that each domain should have it’s own robots.txt file and that includes sub-domains. Separate robots.txt file should be created for “yourwebsite.com” , “about.yourwebsite.com” as well as “waffles.yourwebsite.com”.

    1)Open up a text editor...
    For example: Notepad in Windows; TextEdit in Mac OSX

    2) Start writing your robots.txt file...
    Writing your robots.txt file is very straight forward. The first thing you do is specify which web crawler/ spider/ robot the text applies to. This is done using the “User-agent” statement. A “*” is a wildcard and it means EVERYBODY (all cooperating web crawlers/ spiders/ robots). You then make a “Disallow” statement telling the web crawler/ spider/ robot where it is not allowed to go.

    As a result, the most simple form of the robots.txt file is as follows:

    ------
    User-agent: *
    Disallow: /
    ------

    The above robots.txt file entry tells ALL cooperating web crawlers, spiders and robots to avoid ALL of your website. Obviously this is something you are never going to do... You can also do the exact opposite. The below robots.txt entry allows ALL cooperating web crawlers/ spiders/ robots to visit ALL of your website.

    ------
    User-agent: *
    Disallow:
    ------

    Using the robots.txt you can keep cooperating away from specific files too as in the below example

    ------
    User-agent: *
    Disallow: /directory/file.html
    ------

    Using the robots.txt files you can tell cooperating web crawlers/ spiders/ robots to stay away from one or several directories...

    ------
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /images/
    Disallow: /tmp/
    Disallow: /private/
    ------

    3) In this way, you can write more specific robots.txt documents...

    In the below example, I want to keep the Googlebot out of my /images/ directory but I also want to keep Yahoo!’s bot out of the /videos/ directory. In addition I want to keep ALL cooperating bots out of my /cgi/ and /tmp/ directories. As a final stipulation, I also want VodaBot (okay, I made this one up) to stay away from an image file called pointless.jpg which is in my /images/ directory.

    ------
    User-agent: Googlebot
    Disallow: /images/

    User-agent: yahoo
    Disallow: /videos/

    User-agent: *
    Disallow: /cgi/
    Dissallow: /tmp/

    User-agent: VodaBot
    Dissallow: /images/pointless.jpg
    ------

    Finally, you will note that while the fictitious VodaBot cannot access the file pointless.jpg it can access the rest of my /images/ directory ... but what if I wanted it the other way round? What if I wanted the excellently named VodaBot to NOT be able to access anything in the /images/ directory EXCEPT an image file called “meaning-of-life.jpg”? Then I would use an Allow statement in my robots.txt file.

    ------
    User-agent: VodaBot
    Dissallow: /images/
    Allow: /images/meaning-of-life.jpg
    ------

    Note that Allow MUST come after a Dissallow statement

    You should also be careful when using “/” as depending how you use it, it can mean different things. The following denotes a directory: “/images/” while “/images” (without “/” at the end) means any file in the root directory that begins with “images”. Writing: “Disallow: /images” does not limit access to the /images/ directory in any way, shape or form.

    Have a look at wikipedia’s robots.txt file ( http://en.wikipedia.org/robots.txt ) as an example. It uses comments (the # symbol) to explain how their robots.txt file works. This is a great resource if you’re writing your first robots.txt file.

    4) Save and upload...
    Save your document in plain text format, as robots.txt, making sure that the extension of the text document is .txt. The file you have can be uploaded straight to the root (home) directory of the website it applies to.

    BAD BOTS
    The robots.txt file is a double-edged sword however. You will notice that I make reference to the “cooperating” spiders. Many people have the assumption that the robots.txt file can be used to hide parts of their website from the search engines. I cannot stress how wrong this is.

    There is no official standards body for the robots.txt protocol and there are very, very many search engines out there on the Internet and each has its own crawler/ spider or robot... These must be programmed to follow the instructions laid out in your robots.txt document. Image if a crawler or spider was programmed to visit ONLY the links that the robots.txt told it not to visit. There is nothing to stop it doing this.

    Any parts of your website that you do not want to be visible to anybody should:
    (a) Not be uploaded to your website at all
    (b) Be password protected

    Of these two options, (a) is by far the most effective.

    In general the robots.txt file is not there for security in any way. It is there to improve the Search Engine Optimization of your site to make sure all the hard work that you have done SEOing your website is used in the best and optimum way. It is there to stop Googlebot finding things that would hurt the SEO of your website or are pointless as far as the theme or content of your website goes.

    Suggested further reading:

    How to make the googlebot love ya!

    Google Webmaster Tools 101


    VodaHost

    Your Website People!
    1-302-283-3777 North America / International
    02036089024 / United Kingdom
    291916438 / Australia

    ------------------------

    Top 3 Best Sellers

    Web Hosting - Unlimited disk space & bandwidth.

    Reseller Hosting - Start your own web hosting business.

    Search Engine & Directory Submission - 300 directories + (Google,Yahoo,Bing)


  • Vasili
    Moderator

    • Mar 2006
    • 14683

    #2
    Re: The robots.txt file

    Excellent ... should be updated somewhat though for maximum benefit, with the addition of auto-discovery coding below for the sitemap.xml:

    Complete robots.txt example for XML sitemaps autodiscovery (with no 'disallow' parameters) by adding the "sitemap" line as shown below:

    User-agent: *
    Allow:
    Allow: (etc. for as many as allowing)
    Sitemap: http://www.yoursitename.com/sitemap.xml


    If you have created a sitemap index file (where you specifically echo your donotfollow parameters by deleting the page/item entries manually that were auto-generated by the sitemap generator), you can also reference that by inserting this line of code instead of the above:

    User-agent: *
    Disallow: (enter specific files/pages not to be read)
    Sitemap: Sitemap: http://www.yoursitename.com/sitemap-index.xml


    Basically, before you upload your sitemap.xml file, delete the coding that maps the pages you do not want spidered ... thus, it "mirrors" your 'disallow' instructions in your robots.txt file via simple omission, being sure to alter the robots.txt file as shown above by including the "auto-discovery" of the sitemap code so it becomes a 'Rule'!
    . VodaWebs....Luxury Group
    * Success Is Potential Realized *

    Comment

    • jenvin
      Private

      • Sep 2010
      • 3

      #3
      Re: The robots.txt file

      Hi,
      This is my first post here and I'm not that html brained, but I did understand the above post on the sitemap.
      I have a google sitemap installed on my website, but it won't allow Googlebot-Images access to the images.

      This is what I have in the sitemap for crawlwr access
      User-agent: *

      Disallow: /cgi-bin
      Disallow: /admin
      Disallow: /account.php
      Disallow: /advanced_search.php
      Disallow: /checkout_shipping.php
      Disallow: /create_account.php
      Disallow: /login.php
      Disallow: /password_forgotten.php
      Disallow: /shopping_cart.php
      Disallow: /_vti_bin
      Disallow: /_vti_cnf
      Disallow: /_vti_log
      Disallow: /_vti_pvt
      Disallow: /_vti_txt

      User-agent: Googlebot-Image

      Disallow: /

      Should I take out the "dissallow: /" or put under the dissallow "Allow: /images ?

      I will be thankful for any replies.

      Jen

      Comment

      • HalfDime47
        Private

        • Sep 2010
        • 3

        #4
        Re: The robots.txt file

        Vasili, I am having a problem reaching either of the two links in your post. I am using Firefox/3.6.9. Please advise if these are available elsewhere.
        Thanks.

        Comment

        • Vasili
          Moderator

          • Mar 2006
          • 14683

          #5
          Re: The robots.txt file

          JENVIN
          You cannot have conflicting instructions between the files: the robots.txt file will need to clearly state any disallow, and in this case, you must specifically 'rule' that your images are disallowed to be cached.
          Also, after auto-generating your sitemap.xml (I prefer not to use Google's version, as it is geared to the advantage of their overall scheme rather than purely W3C compliant), you must carefully delete the code "mention" of your image file/page, so there is no gap or spacing in the code as well as no mention of the file/page in existance: the robots.txt file creates a Rule based on a single-stated disallowing, but there is no "affirmation" of reference to a resource otherwise (no clearly noted mention of the file of page, since deleted from the xml sitemap, see?).
          The above was in keeping with the context of the earlier article discussing "hiding" page views, but to answer your question directly, "Yes, in your case you would create a specific 'Agent' mention and a proper 'Allow' Rule, as you show above in your post."

          HALFDIME
          The links above were SAMPLES (note the word "YourSiteName" in them?)
          Replace "yoursitename" with your domain name ....


          You can generate a compliant robots.txt and a sitemap.xml both at this site.
          . VodaWebs....Luxury Group
          * Success Is Potential Realized *

          Comment

          • chriscartoons
            Sergeant

            • Jul 2008
            • 30

            #6
            Re: The robots.txt file

            i'm adding it now as we speak hahahahahahahaha!!

            Comment

            • sunrise2012
              Private

              • Jul 2011
              • 3

              #7
              Re: The robots.txt file

              Hello,

              I'm a total beginner at creating websites, so forgive the dumb questions, please.

              Where should the "robots.txt" information (Disallow, Allow) be placed?
              Somewhere in the html below and on every page in the website? (I have about 35 pages):

              <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <TITLE>xxxxx <META HTTP-EQUIV="Pragma" CONTENT="no-cache"> <META Name="Keywords" Content="xxxxx"> <META Name="Description" Content="xxxxx"> <META NAME="ROBOTS" CONTENT="ALL"> <META NAME="revisit-after" CONTENT="10 days"> <META NAME="author" content="xxxxx"> <META NAME="copyright" content="Copyright 1980-2007 by xxxxx. All Rights Reserved."> <META NAME="resource-type" content="document"> <META NAME="distribution" content="global">

              </HEAD>

              Also, if there is another better way to create the above, I would really
              appreciate knowing that.

              Thank you so much!

              L.N.

              Comment

              • VodaHost
                General & Forum Administrator

                • Mar 2005
                • 12356

                #8
                Re: The robots.txt file

                just pop it into your public_html folder

                VodaHost

                Your Website People!
                1-302-283-3777 North America / International
                02036089024 / United Kingdom
                291916438 / Australia

                ------------------------

                Top 3 Best Sellers

                Web Hosting - Unlimited disk space & bandwidth.

                Reseller Hosting - Start your own web hosting business.

                Search Engine & Directory Submission - 300 directories + (Google,Yahoo,Bing)


                Comment

                • flexworth
                  Sergeant

                  • Jan 2010
                  • 24

                  #9
                  Re: The robots.txt file

                  I don't have any pages I wanted to 'disallow' but I did notice an increase in organic traffic when I uploaded the robot.txt blank file to folder.
                  Jason Stallworth
                  TheMuscleProgram.com (main/first site)
                  http://www.themuscleprogram.com/

                  My product site:
                  http://www.hardcoremusclebuildingprogram.com/

                  http://www.jasonstallworth.com

                  Working on a metal/guitar site...coming soon!

                  Comment

                  Working...
                  X