In earlier posts we have looked at length about how to give your website the best possible exposure on Google as well as on the other search engines and we have looked at the best ways to SEO (Search Engine Optimise) your site. It has been a great deal of work on your part to make sure that your website is accessible to Google and its Googlebot, that there are plenty of keywords, plenty of quality links and a sitemap for it to follow. Today however we are not making your website more accessible to the Googlebot and the other search engine spiders. Quite the opposite...
Today we will be discussing the unthinkable; how to keep search engine spiders off your website or restrict them so they can only look at (or, index) parts of your website. It may feel strange to you to have done so much SEO work only to hide it or parts of it. In this article we will be looking at the anti-sitemap: the robots.txt file (or “Robot Exclusion Standard / Robots Exclusion Protocol” if you are a fan of particularly long phrases...).
GOOD BOTS
The robots.txt file is the opposite to your sitemap and exists to stop cooperating web spiders visiting all or part of your website (because it exists to tell them where they cannot go). It was started in the summer 1994 by agreement of the members of the robots mailing list because, quite simply, it seemed like a good idea. It was made more popular by Alta Vista, then the other big search engines caught on in the following years and started using the robots.txt standard too.
While it may seem that we are actually hurting ourselves by not letting web crawlers/ spiders/ robots look at our website in its entirety, this is actually not the case. There may be pages on your website that, while essential, do not actually help the SEO of your website. It might be a sales page that does not contain any of your keywords (maybe only: “Click Here To Confirm” or “Enter Your Credit Card Details”) and letting a robot look at those pages means a worse ranking on Google (more content; fewer keywords).
The information that you should be restricting using the robots.txt file is information that does not help in any way towards the SEO of your website, but we’ll discuss that again later.
So, let’s create a robots.txt file for your website...
It’s a simple plain text file (.txt), so we can create one using the most basic tools on your home computer. You should note that each domain should have it’s own robots.txt file and that includes sub-domains. Separate robots.txt file should be created for “yourwebsite.com” , “about.yourwebsite.com” as well as “waffles.yourwebsite.com”.
1)Open up a text editor...
For example: Notepad in Windows; TextEdit in Mac OSX
2) Start writing your robots.txt file...
Writing your robots.txt file is very straight forward. The first thing you do is specify which web crawler/ spider/ robot the text applies to. This is done using the “User-agent” statement. A “*” is a wildcard and it means EVERYBODY (all cooperating web crawlers/ spiders/ robots). You then make a “Disallow” statement telling the web crawler/ spider/ robot where it is not allowed to go.
As a result, the most simple form of the robots.txt file is as follows:
------
User-agent: *
Disallow: /
------
The above robots.txt file entry tells ALL cooperating web crawlers, spiders and robots to avoid ALL of your website. Obviously this is something you are never going to do... You can also do the exact opposite. The below robots.txt entry allows ALL cooperating web crawlers/ spiders/ robots to visit ALL of your website.
------
User-agent: *
Disallow:
------
Using the robots.txt you can keep cooperating away from specific files too as in the below example
------
User-agent: *
Disallow: /directory/file.html
------
Using the robots.txt files you can tell cooperating web crawlers/ spiders/ robots to stay away from one or several directories...
------
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/
------
3) In this way, you can write more specific robots.txt documents...
In the below example, I want to keep the Googlebot out of my /images/ directory but I also want to keep Yahoo!’s bot out of the /videos/ directory. In addition I want to keep ALL cooperating bots out of my /cgi/ and /tmp/ directories. As a final stipulation, I also want VodaBot (okay, I made this one up) to stay away from an image file called pointless.jpg which is in my /images/ directory.
------
User-agent: Googlebot
Disallow: /images/
User-agent: yahoo
Disallow: /videos/
User-agent: *
Disallow: /cgi/
Dissallow: /tmp/
User-agent: VodaBot
Dissallow: /images/pointless.jpg
------
Finally, you will note that while the fictitious VodaBot cannot access the file pointless.jpg it can access the rest of my /images/ directory ... but what if I wanted it the other way round? What if I wanted the excellently named VodaBot to NOT be able to access anything in the /images/ directory EXCEPT an image file called “meaning-of-life.jpg”? Then I would use an Allow statement in my robots.txt file.
------
User-agent: VodaBot
Dissallow: /images/
Allow: /images/meaning-of-life.jpg
------
Note that Allow MUST come after a Dissallow statement
You should also be careful when using “/” as depending how you use it, it can mean different things. The following denotes a directory: “/images/” while “/images” (without “/” at the end) means any file in the root directory that begins with “images”. Writing: “Disallow: /images” does not limit access to the /images/ directory in any way, shape or form.
Have a look at wikipedia’s robots.txt file ( http://en.wikipedia.org/robots.txt ) as an example. It uses comments (the # symbol) to explain how their robots.txt file works. This is a great resource if you’re writing your first robots.txt file.
4) Save and upload...
Save your document in plain text format, as robots.txt, making sure that the extension of the text document is .txt. The file you have can be uploaded straight to the root (home) directory of the website it applies to.
BAD BOTS
The robots.txt file is a double-edged sword however. You will notice that I make reference to the “cooperating” spiders. Many people have the assumption that the robots.txt file can be used to hide parts of their website from the search engines. I cannot stress how wrong this is.
There is no official standards body for the robots.txt protocol and there are very, very many search engines out there on the Internet and each has its own crawler/ spider or robot... These must be programmed to follow the instructions laid out in your robots.txt document. Image if a crawler or spider was programmed to visit ONLY the links that the robots.txt told it not to visit. There is nothing to stop it doing this.
Any parts of your website that you do not want to be visible to anybody should:
(a) Not be uploaded to your website at all
(b) Be password protected
Of these two options, (a) is by far the most effective.
In general the robots.txt file is not there for security in any way. It is there to improve the Search Engine Optimization of your site to make sure all the hard work that you have done SEOing your website is used in the best and optimum way. It is there to stop Googlebot finding things that would hurt the SEO of your website or are pointless as far as the theme or content of your website goes.
Suggested further reading:
How to make the googlebot love ya!
Google Webmaster Tools 101
Today we will be discussing the unthinkable; how to keep search engine spiders off your website or restrict them so they can only look at (or, index) parts of your website. It may feel strange to you to have done so much SEO work only to hide it or parts of it. In this article we will be looking at the anti-sitemap: the robots.txt file (or “Robot Exclusion Standard / Robots Exclusion Protocol” if you are a fan of particularly long phrases...).
GOOD BOTS
The robots.txt file is the opposite to your sitemap and exists to stop cooperating web spiders visiting all or part of your website (because it exists to tell them where they cannot go). It was started in the summer 1994 by agreement of the members of the robots mailing list because, quite simply, it seemed like a good idea. It was made more popular by Alta Vista, then the other big search engines caught on in the following years and started using the robots.txt standard too.
While it may seem that we are actually hurting ourselves by not letting web crawlers/ spiders/ robots look at our website in its entirety, this is actually not the case. There may be pages on your website that, while essential, do not actually help the SEO of your website. It might be a sales page that does not contain any of your keywords (maybe only: “Click Here To Confirm” or “Enter Your Credit Card Details”) and letting a robot look at those pages means a worse ranking on Google (more content; fewer keywords).
The information that you should be restricting using the robots.txt file is information that does not help in any way towards the SEO of your website, but we’ll discuss that again later.
So, let’s create a robots.txt file for your website...
It’s a simple plain text file (.txt), so we can create one using the most basic tools on your home computer. You should note that each domain should have it’s own robots.txt file and that includes sub-domains. Separate robots.txt file should be created for “yourwebsite.com” , “about.yourwebsite.com” as well as “waffles.yourwebsite.com”.
1)Open up a text editor...
For example: Notepad in Windows; TextEdit in Mac OSX
2) Start writing your robots.txt file...
Writing your robots.txt file is very straight forward. The first thing you do is specify which web crawler/ spider/ robot the text applies to. This is done using the “User-agent” statement. A “*” is a wildcard and it means EVERYBODY (all cooperating web crawlers/ spiders/ robots). You then make a “Disallow” statement telling the web crawler/ spider/ robot where it is not allowed to go.
As a result, the most simple form of the robots.txt file is as follows:
------
User-agent: *
Disallow: /
------
The above robots.txt file entry tells ALL cooperating web crawlers, spiders and robots to avoid ALL of your website. Obviously this is something you are never going to do... You can also do the exact opposite. The below robots.txt entry allows ALL cooperating web crawlers/ spiders/ robots to visit ALL of your website.
------
User-agent: *
Disallow:
------
Using the robots.txt you can keep cooperating away from specific files too as in the below example
------
User-agent: *
Disallow: /directory/file.html
------
Using the robots.txt files you can tell cooperating web crawlers/ spiders/ robots to stay away from one or several directories...
------
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/
------
3) In this way, you can write more specific robots.txt documents...
In the below example, I want to keep the Googlebot out of my /images/ directory but I also want to keep Yahoo!’s bot out of the /videos/ directory. In addition I want to keep ALL cooperating bots out of my /cgi/ and /tmp/ directories. As a final stipulation, I also want VodaBot (okay, I made this one up) to stay away from an image file called pointless.jpg which is in my /images/ directory.
------
User-agent: Googlebot
Disallow: /images/
User-agent: yahoo
Disallow: /videos/
User-agent: *
Disallow: /cgi/
Dissallow: /tmp/
User-agent: VodaBot
Dissallow: /images/pointless.jpg
------
Finally, you will note that while the fictitious VodaBot cannot access the file pointless.jpg it can access the rest of my /images/ directory ... but what if I wanted it the other way round? What if I wanted the excellently named VodaBot to NOT be able to access anything in the /images/ directory EXCEPT an image file called “meaning-of-life.jpg”? Then I would use an Allow statement in my robots.txt file.
------
User-agent: VodaBot
Dissallow: /images/
Allow: /images/meaning-of-life.jpg
------
Note that Allow MUST come after a Dissallow statement
You should also be careful when using “/” as depending how you use it, it can mean different things. The following denotes a directory: “/images/” while “/images” (without “/” at the end) means any file in the root directory that begins with “images”. Writing: “Disallow: /images” does not limit access to the /images/ directory in any way, shape or form.
Have a look at wikipedia’s robots.txt file ( http://en.wikipedia.org/robots.txt ) as an example. It uses comments (the # symbol) to explain how their robots.txt file works. This is a great resource if you’re writing your first robots.txt file.
4) Save and upload...
Save your document in plain text format, as robots.txt, making sure that the extension of the text document is .txt. The file you have can be uploaded straight to the root (home) directory of the website it applies to.
BAD BOTS
The robots.txt file is a double-edged sword however. You will notice that I make reference to the “cooperating” spiders. Many people have the assumption that the robots.txt file can be used to hide parts of their website from the search engines. I cannot stress how wrong this is.
There is no official standards body for the robots.txt protocol and there are very, very many search engines out there on the Internet and each has its own crawler/ spider or robot... These must be programmed to follow the instructions laid out in your robots.txt document. Image if a crawler or spider was programmed to visit ONLY the links that the robots.txt told it not to visit. There is nothing to stop it doing this.
Any parts of your website that you do not want to be visible to anybody should:
(a) Not be uploaded to your website at all
(b) Be password protected
Of these two options, (a) is by far the most effective.
In general the robots.txt file is not there for security in any way. It is there to improve the Search Engine Optimization of your site to make sure all the hard work that you have done SEOing your website is used in the best and optimum way. It is there to stop Googlebot finding things that would hurt the SEO of your website or are pointless as far as the theme or content of your website goes.
Suggested further reading:
How to make the googlebot love ya!
Google Webmaster Tools 101
Comment