Robots.txt, Robots Meta Tag, .htaccess mod_rewrite

Search-Engine-Marketing

There are three commonly supported methods for instructing/requesting internet indexing spiders/bots/robots what to scan and what to skip. Each of these methods are complimentary in usefulness to each other, but none are not equal in effect.

  1. robots.txt
  2. Robots Meta Tag
  3. .htacess and mod_rewrite

Summary:

To really protect and enforce rules for any specific user agent that is visiting your website you will have to constantly analyze website traffic analytics, bandwidth reports and visiting IP addresses and geographic locations, known pubilc or private proxy servers, and the specific methods and tactics of EVERY unwanted program and visitor and be able to implement new means to thwart their new methods on a regular basis.

Block Unwanted Visitors by IP Address or UserAgent in Apache using mod_rewrite

Use .htaccess rules to block unwanted bots, spiders and other UserAgents that don’t fetch, or that fetch and ignore robots.txt.

Blocking visitors by IP address filtering in .htaccess file:

# deny specific IP addresses, and allow all others
order allow, deny
deny from 123.45.6.7
deny from 123.45.6.8
deny from 123.45.6.9
allow from all


Block specific UserAgent using mod_rewrite

   # Block Google Images Bot from Indexing your Copyrighted Images
   # Hopefully someday Google will publish a "supported way" of
   # Disallowing the Google Image Bot when necessary, but until then...
   RewriteEngine on
   RewriteCond %{HTTP_USER_AGENT} ^Googlebot-Image
   RewriteRule ^(.*)$ http://images.google.com/


The catch-22 with this method is that “sneaky” program developers can simply masquerade as “normal” visitors by using common web browser user agent strings. Reinforcing the fact that all three of these methods are USEFUL, but in no way a complete or secure solution even with the precise use of all three.


Also see:

  1. robots.txt
  2. Robots Meta Tag

Robots Meta Tag

meta-tags

Use an embedded meta tag on a specific page to instruct search engine spiders and robots what to index and disallow:

  1. Pages including “noindex, nofollow” indicate that they are NOT to be index, NOT to be included in listings, and NOT to be scanned for reciprocal links.
  2. Pages including “index, nofollow” indicate that they are to be indexed and listed, but not scanned for reciprocal links.
  3. Pages including “index, follow” indicate that they are to be fully index and scanned for all reciprocal links and included in all applicable listings.

DO NOT index, DO NOT include in listings, and DO NOT follow reciprocal links

<input name="robots" content="noindex, nofollow" />

Index, include in listings, but DO NOT follow reciprocal links

<input name="robots" content="index, nofollow" />

Index, include in listings, and follow reciprocal links

<input name="robots" content="index, follow" />

Also see:

  1. Robots.txt
  2. .htacess and mod_rewrite

Robots.txt

Robots.txt is a plain text file that is implemented in the root directory of a URI as a configuration file used by some search engine spiders and internet robots/bot programs to help direct them to what you want to be indexed and what you don’t. Although many robots will read and follow your instructions in the “/robots.txt” file, many ‘less compliant’ programs may actually ignore this file completely.

Here are a few examples of robots.txt file (plain text):

Ask all search engines to NOT index or follow links on the entire website:

#asks all search engines to NOT index and NOT follow any pages or links on the entire website
User-agent: *
Disallow: /

Allows all search engines to index and follow links on the entire website by Disallowing nothing:

#allows search engines to index and follow all pages and links on the entire website by Disallowing nothing
User-agent: *
Disallow:

Disallows specific folders and files from indexing and following:

User-agent: *
Disallow: /uploads/ # since this folder may contain secure, private, cached or temporary files, we should disallow this entire folder from being indexed.
Disallow: /tmp/ # since this folder may contain cached or temporary files, we should disallow this entire folder from being indexed
Disallow: /page.php


Also see:

  1. Robots Meta Tag
  2. .htacess and mod_rewrite

A List of Major and Minor Search Engines with PageRank and Alexa Rank

Search Engine Page Rank Alexa Rank
Google 10/10 1
Alexa 7/10 N/A
WhatUSeek 7/10 28,871
Scrub The Web 6/10 12,154
Entireweb 6/10 16,913
Acoon.de 6/10 144,419
SearchSight 5/10 15,631
Infotiger 5/10 31,364
Websquash 5/10 38,821
FyberSearch 5/10 40,029
Aesop 5/10 93,220
FeedPlex 5/10 95,852
Walhello 5/10 125,261
Abacho 5/10 174,284
Search Luxembourg 5/10 368,225
Jayde 4/10 13,494
TowerSearch 4/10 42,542
Amfibi 4/10 55,399
Cipinet 4/10 57,721
Search Engine Page Rank Alexa Rank
Claymont 4/10 58,898
Burf 4/10 75,057
Acoon 4/10 147,124
Megaglobe 4/10 152,118
DinoSearch 4/10 194,222
GhetoSearch 4/10 338,592
SentenceSeek 4/10 408,247
Mojeek 4/10
The Lesson Finder 4/10 1,365,580
Search-O-Rama 3/10 33,199
AxxaSearch 3/10 62,737
SearchRamp 3/10 66,091
BigFinder 3/10 151,855
Super.info 3/10 190,141
SurfGopher 3/10 280,391
Myahint 3/10 366,402
Boitho 0/10 598,399
Famhoo 0/10 807,861

Next Page »