Robots.txt, Robots Meta Tag, .htaccess mod_rewrite

Search-Engine-Marketing

There are three commonly supported methods for instructing/requesting internet indexing spiders/bots/robots what to scan and what to skip. Each of these methods are complimentary in usefulness to each other, but none are not equal in effect.

  1. robots.txt
  2. Robots Meta Tag
  3. .htacess and mod_rewrite

Summary:

To really protect and enforce rules for any specific user agent that is visiting your website you will have to constantly analyze website traffic analytics, bandwidth reports and visiting IP addresses and geographic locations, known pubilc or private proxy servers, and the specific methods and tactics of EVERY unwanted program and visitor and be able to implement new means to thwart their new methods on a regular basis.

Block Unwanted Visitors by IP Address or UserAgent in Apache using mod_rewrite

Use .htaccess rules to block unwanted bots, spiders and other UserAgents that don’t fetch, or that fetch and ignore robots.txt.

Blocking visitors by IP address filtering in .htaccess file:

# deny specific IP addresses, and allow all others
order allow, deny
deny from 123.45.6.7
deny from 123.45.6.8
deny from 123.45.6.9
allow from all


Block specific UserAgent using mod_rewrite

   # Block Google Images Bot from Indexing your Copyrighted Images
   # Hopefully someday Google will publish a "supported way" of
   # Disallowing the Google Image Bot when necessary, but until then...
   RewriteEngine on
   RewriteCond %{HTTP_USER_AGENT} ^Googlebot-Image
   RewriteRule ^(.*)$ http://images.google.com/


The catch-22 with this method is that “sneaky” program developers can simply masquerade as “normal” visitors by using common web browser user agent strings. Reinforcing the fact that all three of these methods are USEFUL, but in no way a complete or secure solution even with the precise use of all three.


Also see:

  1. robots.txt
  2. Robots Meta Tag