Control the Search Engine Spiders with Robots.txt

Search Engine Spiders are simple programs which crawl the content of your website, and helps in the indexing of websites. The search engine spiders and its activities have a major role in determining your search engine rankings. The main tool for controlling the search engine spider activities on your site is Robot.txt file. This is a simple .txt file which contains certain codes and just needs to be uploaded to the root of your website.

With the robots.txt file you will find it easier to direct the spiders to the most important pages on your website, so that you can improve your rankings for your website. You can also prevent the search engine spiders from crawling unwanted folders or least important files. For example, a website may contain less important pages such as privacy policy, terms and conditions, about us etc. These pages would not help in improving your search engine ranking and it is of no use for the search engine spiders in crawling these pages. In this case you can write special code on the robots.txt file, through which you can prevent the spiders from crawling them.

You can control the search engine spiders with a properly written robots.txt file, which is uploaded to the root of your website. In case if you have sub domains for you website, you need to create separate robots.txt for those sub domains. It is also better to have separate robots.txt for secure (https) and non secure (http) web pages.

To create a robots.txt file you just need to save your text file as robots.txt. The basic code for robots.txt is simple and is given below.

User-agent: *

User-agent is the name of the search engine spider that is allowed to crawl the website. The symbol * means, all the search engine spiders are allowed to crawl the site. If you want only particular search engine spiders such as Googlebot (Google), Slurp (Yahoo) msnbot (Microsoft) to crawl your website, then you can mention those names, instead of the symbol *. If you want to prevent a particular folder named “personal” of your website you can rewrite the code as below.

User-agent: *
Disallow: /personal/

The code “Disallow” specifies the folder or page on your website that you don’t want the spiders to crawl. If you want to prevent spiders from crawling more than one folder, you have to create multiple Disallow line to the robots.txt file. For example apart from “personal” folder, you want to prevent the spiders from the crawling your folders such as “archive”, “temp”, “clients”, then you can rewrite the code as below.

User-agent: *
Disallow: /personal/
Disallow: /archive/
Disallow: /temp/
Disallow: /clients/

In case you want all the spiders to ignore your above mentioned folders, except the Google spider to crawl your folder named “personal”, the code can be rewritten as follows.

User-agent: *
Disallow: /personal/
Disallow: /archive/
Disallow: /temp/
Disallow: /clients/

User-agent: googlebot
Allow: /personal/

These are just some of the simple robots.txt codes, that I would like to share with you. You can know more about robots.txt file and its codes from here. Hope it would be useful for all the readers.

Leave a Reply

Your email address will not be published. Required fields are marked *