Home

La Bitacora

Suckerfish Menu

  • My-Complaint
    • Design
    • Tasks
    • Statistics
    • Outcome
    • Income
    • Linking
  • Software Fountain
    • Design
    • Tasks
    • Statistics
    • Outcome
    • Income
    • Linking
  • Home
  • About
  • Projects
  • Tools
  • Statistics
  • Income
  • Forums
  • Ideas

Navigation

  • Blogs

Archives

July 2009
SunMonTueWedThuFriSat
1234
567891011
12131415161718
19202122232425
262728293031

La Bitacora Alexa's Rank

User login

  • Create new account
  • Request new password
To learn about Search Engine Optimization (SEO), internet marketing, website promotion and, of course, to have fun while we are learning

La Bitacora New Challenge

Submitted by masoha on Mon, 12/22/2008 - 04:16.
In spanish

New Year project changes

December 21, 2008

Have been two months since my last performance report, due to several external factors, La Bitacora project and all atached domains are almost the same as they were at October 10, 2008 (last report), when I updated all; no new content or SEO processes have been applied since then.

  • masoha's blog
  • Login or register to post comments
  • Read more

Robots.txt - Search Engine Optimization - First Step

Submitted by masoha on Tue, 10/14/2008 - 22:09.
En Español

Robots.txt - First SEO Step

October 12, 2008

 

So, what is all that buzz about "robots.txt"?

Let's start with the fact that robots.txt is a special file used to have the first interaction of a website with robots and spiders, it is used to instruct non human visitors on how to access our site, what files or folders are not available for indexing, it also determines if a robot is welcome or not.

Even though robots.txt is a standard it is not owned by any person, asociation or standards body, therefore there are no guarantees that all robots, bots, spiders and other automatic agents will follow its rules, nonetheless it is a common practice to make robots to obey, at least, the Standard for Robots Exclusions.

In general a "good robot" will follow the exclusion rules while a "bad bot" will not or it will partially.

The real importance behind robots.txt is to help good robots to index the pages that you want indexed while keep them away from the pages that you want to keep private.

Security Issues

A security issue related to robots.txt is that by marking a folder for exclusion in your robot.txt you are actually, telling them that the folder is there, so human visitors and bad bots will be attracted to visit your private files, by guessing the urls. So NEVER try to hide information available in your website using the robots.txt file, instead protect your private files and directories with passwords to get a better control on how access them or better yet configure your server to do user authentication before the private information is displayed.

Google and robots.txt

Google's webmaster Guidelines suggest in several pages the use of robots.txt the following paragraph was extracted of the Google Guidelines pages:

"...Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler..."

Since Google policy about duplicate content in a site is very strict they recommend to configure robots.txt to prevent that Googlebot crawl and index those duplicated pages.

Good Bots - BadBots

While Good bots visit your site to discover new information and pages to feed search engine results, bad bots not only consume bandwidth but also stealth websites content, harvest email addresses and (worst) search for website security weaknesses.

A crawler that follows robots.txt directives is considered a good bot, however, this is not always true because many of them fake good behavior, so a webmaster should combine the robots.txt file directives plus .htaccess file and traffic logs to control and keep bad crawlers away.

I'm building a list of bad bots that crawl domains on our project, I will add crawlers that hit our other domains and to make it useful I will add know badbots even though they aren't visited our sites yet. This list will be available in a couple of days, check our blog for updates.

Robots Exclusion Standard

  1. Name: the file must be named robots.txt all letters lower case
  2. Ubication: robots.txt MUST BE on the site root, any other ubication will be ignored by crawlers: I.E: http://labitacora.org/robots.txt is the right url while http://labitacora.org/site/robots.txt is not.
  3. Format: robots.txt MUST BE created and edited using a plain text editor, text format is not allowed

Robots.txt directives are simple, the first standard has only two directives:

  • User-agent: | Format: User-agent:<space><Agent Name or wildcard(*)>. Examples: User-agent: GoogleBot | User-agent: *
  • Disallow: | Format: Disallow:<space><path/file or wildcard(/)>. Examples: Disallow: /private/ | Disallow: Directory/file.htm

The symbol # is used to indicate a commentary, it can be used at the begining of a line or after a directive, all characters after the symbol # are ignored

Non Standard Directives

There are some directives that are only supported by big search engines and crawlers:

  • Crawl-delay: Indicates how many seconds the crawler must wait between requests. Example: Craw-delay: 10
  • Allow: <path>. This directive helps to fine tune the disallow directive, it is used, in most cases, to instruct crawlers to access files or subdirectories that are blocked by the disallow directive.

Extended Standard

A new standard has been proposed, it includes several new directives, some of them are already supported by some crawlers:

  • Request-rate: | Format: Request-rate:<space><pages>/<seconds>: It indicates how many pages can be crawled in a time interval given in seconds | Example: Request-rate: 2/10, two pages can be crawled each ten seconds.
  • Visit-time: | Format: Visit-time: <start>-<finish> | Example: Visit-time: 700-900, Visit only after 6:00 AM until 9:00 AM, it is very useful to balance traffic in popular websites

Examples:

All robots allowed

User-agent: *

Disallow:

All robots are out, not crawlers allowed

User-agent: *
Disallow: /

Only Googlebot is allowed

User-agent: GoogleBot
Disallow:

User-agent: *
Disallow:

Prevent folders to be crawled and indexed

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

Some BadBots Exclusions

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. extracted from wikipedia's robot.txt

User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: Zealbot
Disallow: /

User-agent: MSIECrawler
Disallow: /

User-agent: SiteSnagger
Disallow: /

User-agent: WebStripper
Disallow: /

User-agent: WebCopier
Disallow: /

User-agent: Fetch
Disallow: /

User-agent: Offline Explorer
Disallow: /

User-agent: Teleport
Disallow: /

User-agent: TeleportPro
Disallow: /

User-agent: WebZIP
Disallow: /

User-agent: linko
Disallow: /

User-agent: HTTrack
Disallow: /

User-agent: Microsoft.URL.Control
Disallow: /

User-agent: Xenu
Disallow: /

User-agent: larbin
Disallow: /

User-agent: libwww
Disallow: /

User-agent: ZyBORG
Disallow: /

User-agent: Download Ninja
Disallow: /

  • masoha's blog
  • Login or register to post comments

LaBitacora General Report October 10, 2008

Submitted by masoha on Sun, 10/12/2008 - 23:10.
En Español

General Report

October 10, 2008

 

Even though I haven't worked in La Bitacora Project or their atached domains with the due diligence (at least fo the past four weeks), things are changing at a good pace.

  • masoha's blog
  • Login or register to post comments
  • Read more

Validadores y Herramientas para lenguajes de marcadores

Submitted by masoha on Sun, 09/28/2008 - 15:41.
In English

Herramientas HTML, XML y CSS


  • Login or register to post comments
  • Read more

Verificacion de enlaces

Submitted by masoha on Sun, 09/28/2008 - 14:39.
This page in English

Herramientas para verificacion de Enlaces


link checkers
  • Login or register to post comments
  • Read more

Herramientas SEO

Submitted by masoha on Sat, 09/27/2008 - 16:57.
Esta pagina en español

Herramientas de optimizacion para motores de busqueda

Para comprender mejor la importancia del listado de herramientas presentado a continuacion, es necesario comenzar con una definicion de SEO o Optimizacion para Motores de Busqueda

  • 1 comment
  • Read more

Quanta Plus Un poderosa herramienta gratuita de edicion

Submitted by masoha on Sat, 09/27/2008 - 16:30.
This page in English

Quanta Plus - Un ambiente de desarrollo para KDE

Amaya screenshot
  • Login or register to post comments
  • Read more

KompoZer editor WYSIWYG gratuito

Submitted by masoha on Sat, 09/27/2008 - 15:47.
La Bitacora SEO Tools in Enlgish

KompoZer: Sistema de autoria Web

  • Login or register to post comments
  • Read more

bluefish Editor Linux gratuito

Submitted by masoha on Fri, 09/26/2008 - 20:31.
English Version

Bluefish Editor con codigo libre

bluefish screenshot
  • Login or register to post comments
  • Read more

Arachnophilia Un editor de texto avanzado

Submitted by masoha on Fri, 09/26/2008 - 14:26.
This page in Spanish

Arachnophilia - Advanced Text Editor

 

  • Login or register to post comments
  • Read more
123next ›last »
You are free to use content from this site as long as you give us credit for it