Ever wanted to keep a part of your website from being indexed by the search engines? A simple robot.txt file allows you to keep the search engine spiders from crawling your files. Here’s a quick little guide on setting one up for your webiste.

Start off by opening up a text editor like notepad. Enter the following lines:

User-agent: *
Disallow: /

Save it as robots.txt and upload it to the root of your website. Most of the time, this is in the /public_html/ folder. The example above instructs the spider that you don’t want your site crawled. If you want to prevent the spider from crawling specific directories, use the following instead:

User-agent: *
Disallow: /test/

The example above is telling the spider to not crawl the folder called ‘test’. Don’t forget the trailing slash ( / ) mark. If you want to prevent the spider from crawling multiple directories, just enter them on a new line like this:

User-agent: *
Disallow: /test/
Disallow: /blog/
Disallow: /forum/

Note that all this also prevents the spider from crawling any files or sub-directories inside the directory you specify. So in this last example above, if I had a folder called ‘research’ underneath the folder called ‘test’, then it would also not get spidered.

Lastly, if you have any programs running like a blog or forum that you don’t want to get spidered, then make sure that you turn off any RSS feeds and update ping functions.

Technorati Tags: , , , ,

Social Bookmark This:These icons link to social bookmarking sites where readers can share and discover new web pages.
  • BlinkList
  • blogmarks
  • del.icio.us
  • digg
  • Fark
  • Furl
  • Ma.gnolia
  • NewsVine
  • Netvouz
  • Reddit
  • Shadows
  • Simpy
  • Spurl
  • YahooMyWeb