La Bitacora New Challenge
New Year project changes
December 21, 2008
Have been two months since my last performance report, due to several external factors, La Bitacora project and all atached domains are almost the same as they were at October 10, 2008 (last report), when I updated all; no new content or SEO processes have been applied since then.
- masoha's blog
- Login or register to post comments
- Read more
Robots.txt - Search Engine Optimization - First Step
Robots.txt - First SEO Step
October 12, 2008
So, what is all that buzz about "robots.txt"?
Let's start with the fact that robots.txt is a special file used to have the first interaction of a website with robots and spiders, it is used to instruct non human visitors on how to access our site, what files or folders are not available for indexing, it also determines if a robot is welcome or not.
Even though robots.txt is a standard it is not owned by any person, asociation or standards body, therefore there are no guarantees that all robots, bots, spiders and other automatic agents will follow its rules, nonetheless it is a common practice to make robots to obey, at least, the Standard for Robots Exclusions.
In general a "good robot" will follow the exclusion rules while a "bad bot" will not or it will partially.
The real importance behind robots.txt is to help good robots to index the pages that you want indexed while keep them away from the pages that you want to keep private.
Security Issues
A security issue related to robots.txt is that by marking a folder for exclusion in your robot.txt you are actually, telling them that the folder is there, so human visitors and bad bots will be attracted to visit your private files, by guessing the urls. So NEVER try to hide information available in your website using the robots.txt file, instead protect your private files and directories with passwords to get a better control on how access them or better yet configure your server to do user authentication before the private information is displayed.
Google and robots.txt
Google's webmaster Guidelines suggest in several pages the use of robots.txt the following paragraph was extracted of the Google Guidelines pages:
"...Make use of the robots.txt file on your web server. This file tells crawlers which directories can or cannot be crawled. Make sure it's current for your site so that you don't accidentally block the Googlebot crawler..."
Since Google policy about duplicate content in a site is very strict they recommend to configure robots.txt to prevent that Googlebot crawl and index those duplicated pages.
Good Bots - BadBots
While Good bots visit your site to discover new information and pages to feed search engine results, bad bots not only consume bandwidth but also stealth websites content, harvest email addresses and (worst) search for website security weaknesses.
A crawler that follows robots.txt directives is considered a good bot, however, this is not always true because many of them fake good behavior, so a webmaster should combine the robots.txt file directives plus .htaccess file and traffic logs to control and keep bad crawlers away.
I'm building a list of bad bots that crawl domains on our project, I will add crawlers that hit our other domains and to make it useful I will add know badbots even though they aren't visited our sites yet. This list will be available in a couple of days, check our blog for updates.
Robots Exclusion Standard
- Name: the file must be named robots.txt all letters lower case
- Ubication: robots.txt MUST BE on the site root, any other ubication will be ignored by crawlers: I.E: http://labitacora.org/robots.txt is the right url while http://labitacora.org/site/robots.txt is not.
- Format: robots.txt MUST BE created and edited using a plain text editor, text format is not allowed
Robots.txt directives are simple, the first standard has only two directives:
- User-agent: | Format: User-agent:<space><Agent Name or wildcard(*)>. Examples: User-agent: GoogleBot | User-agent: *
- Disallow: | Format: Disallow:<space><path/file or wildcard(/)>. Examples: Disallow: /private/ | Disallow: Directory/file.htm
The symbol # is used to indicate a commentary, it can be used at the begining of a line or after a directive, all characters after the symbol # are ignored
Non Standard Directives
There are some directives that are only supported by big search engines and crawlers:
- Crawl-delay: Indicates how many seconds the crawler must wait between requests. Example: Craw-delay: 10
- Allow: <path>. This directive helps to fine tune the disallow directive, it is used, in most cases, to instruct crawlers to access files or subdirectories that are blocked by the disallow directive.
Extended Standard
A new standard has been proposed, it includes several new directives, some of them are already supported by some crawlers:
- Request-rate: | Format: Request-rate:<space><pages>/<seconds>: It indicates how many pages can be crawled in a time interval given in seconds | Example: Request-rate: 2/10, two pages can be crawled each ten seconds.
- Visit-time: | Format: Visit-time: <start>-<finish> | Example: Visit-time: 700-900, Visit only after 6:00 AM until 9:00 AM, it is very useful to balance traffic in popular websites
Examples:
All robots allowed
User-agent: *
Disallow:
All robots are out, not crawlers allowed
User-agent: *
Disallow: /
Only Googlebot is allowed
User-agent: GoogleBot
Disallow:
User-agent: *
Disallow:
Prevent folders to be crawled and indexed
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/
Some BadBots Exclusions
# Some bots are known to be trouble, particularly those designed to copy
# entire sites. extracted from wikipedia's robot.txt
User-agent: sitecheck.internetseer.com
Disallow: /
User-agent: Zealbot
Disallow: /
User-agent: MSIECrawler
Disallow: /
User-agent: SiteSnagger
Disallow: /
User-agent: WebStripper
Disallow: /
User-agent: WebCopier
Disallow: /
User-agent: Fetch
Disallow: /
User-agent: Offline Explorer
Disallow: /
User-agent: Teleport
Disallow: /
User-agent: TeleportPro
Disallow: /
User-agent: WebZIP
Disallow: /
User-agent: linko
Disallow: /
User-agent: HTTrack
Disallow: /
User-agent: Microsoft.URL.Control
Disallow: /
User-agent: Xenu
Disallow: /
User-agent: larbin
Disallow: /
User-agent: libwww
Disallow: /
User-agent: ZyBORG
Disallow: /
User-agent: Download Ninja
Disallow: /
- masoha's blog
- Login or register to post comments
LaBitacora General Report October 10, 2008
General Report
October 10, 2008
Even though I haven't worked in La Bitacora Project or their atached domains with the due diligence (at least fo the past four weeks), things are changing at a good pace.
- masoha's blog
- Login or register to post comments
- Read more
Herramientas SEO
Herramientas de optimizacion para motores de busqueda
Para comprender mejor la importancia del listado de herramientas presentado a continuacion, es necesario comenzar con una definicion de SEO o Optimizacion para Motores de Busqueda



