Crawling control

SeznamBot fully complies with the robots exclusion standard (or simply robots.txt), which specifies the rules of robots behavior through a robots.txt file. A robots.txt file contains instructions that specify which content of the website the robots are / are not allowed to access and download. All robots visiting your web that follow this standard read this file first when accessing the web and they adjust their behavior according to the directives in the file. You can find detailed description of its syntax on the official website of the standard.

Using the robots.txt standard you can stop all crawling performed by SeznamBot on your web or only stop downloading of exactly specified pages. It typically takes several days for our crawler to recheck the restriction in the robots.txt file and eventually update the index, though for some sites that are not visited often, it can take up to several weeks. In case you only want to stop indexing of a page but allow SeznamBot to download and explore it, see Indexing Control. In that case you should also allow SeznamBot to download the page in the robots.txt file so that it is able to read the restrictions in the HTML code.

Tip

If you want to keep SeznamBot from accessing your site altogether, use the following directives in your robots.txt file:

User-agent: SeznamBot Disallow: /

Nonstandard Extensions of robots.txt Syntax Recognized By SeznamBot

On top of the official version 1.0 standard SeznamBot recognizes other directives. These extensions are listed below in separate sections.

Allow Directive

The syntax of the Allow directive is the same as with the standard Disallow directive, except for the name. Use of the directive explicitly allows robots’ access to a given URL(s). This is useful when you want to instruct robots to avoid an entire directory but still want some HTML documents in that directory crawled and indexed.

Examples

`User-agent: * Disallow:`	All robots can access and download all pages of the web. The empty space following the `Disallow`/`Allow` directive means that the directive doesn’t apply at all. This is the default (empty or nonexistent `robots.txt` file has the same effect).
`User-agent: * Allow:`
`User-agent: * Disallow: /`	No robot can download any page.
`User-agent: * Disallow: /archive/ Disallow: /abc`	No robot can enter the /archive/ directory of the website. Furthermore, no robot can download any page with a name starting with “abc“.
`User-agent: * Disallow: / Allow: /A/ Disallow: /A/B/`	All robots can download files only from the /A/ directory and its subdirectories except for the subdirectory B/. The order of the directives is not important.
`User-agent: SeznamBot Disallow: /`	SeznamBot can’t download anything from the website. Other robots are allowed by default.
`User-agent: SeznamBot Disallow: /discussion/`	SeznamBot can’t download the /discussion/ directory. Other robots are allowed by default.

Wildcards

You can use the following wildcards in a robots.txt file:

*	any number of any characters (an arbitrary string). Can be used multiple times in a directive.
$	the end of the address string

Examples

`User-agent: SeznamBot Disallow: *.pdf$`	Disallow downloading all files with addresses ending with „.pdf“ (regardless of the characters preceding it).
`User-agent: SeznamBot Disallow: /*/discussion/$`	Disallow downloading the default document in any of the /discussion/ subdirectories while still allowing downloading all other files in those subdirectories.
`User-agent: SeznamBot Disallow: /discussion$`	Disallow /discussion, allowing /discussion-01, /discussion/-02 etc.

Request-rate Directive

The Request-rate directive is used to tell robots how many documents from a website they can download during a given time period. Seznambot fully respects this directive, which enables you to set this download rate in a way that prevents your servers from being overloaded or even crushed. On the other hand, if you want your files to be processed by SeznamBot at a faster rate, you can set the Request-rate to a higher value.

The request rate directive syntax is: Request-rate: <number of documents>/<time>

You can also specify a time period in the day, during which the robot will observe the rate set by the directive. In the rest of the day, it will return to its regular behavior.

The general syntax in this case is: Request-rate: <rate> <time of day>

Examples

`Request-rate: 1/10s`	Robots are allowed to download one document every ten seconds.
`Request-rate: 100/15m`	100 documents every 15 minutes
`Request-rate: 400/1h`	400 documents every hour
`Request-rate: 9000/1d`	9000 documents every day
`Request-rate: 1/10s 1800-1900`	Robots are allowed to download one document every ten seconds between 18:00 and 19:00 (UTC). In other times there are no limits for download rate.

Caution

The minimum download rate for SeznamBot is 1 document every 10 seconds. If you specify a lower value, SeznamBot will interpret it as this minimum rate. The maximum rate is only limited by the current speed of SeznamBot.

Examples (specific and all other robots)
User-agent: * 
Disallow: /images/ 
Request-rate: 30/1m 
# all robots except for SeznamBot and Googlebot: do not access /images/ directory,
# rate 30 URLs per minute

User-agent: SeznamBot 
Disallow: /cz/chat/ 
Request-rate: 300/1m 
# SeznamBot: do not access /cz/chat/ directory, rate 300 URLs per minute 

User-agent: Googlebot 
Disallow: /logs/ 
Request-rate: 10/1m 
# Googlebot: do not access /logs/, rate 10 URLs per minute

Examples (SeznamBot and all other robots
User-agent: * 
Disallow: / 
# all robots except for SeznamBot: do not access anything 

User-agent: Seznambot 
Request-rate: 300/1m 
# Seznambot: access everything, rate 300 URLs per minute

Sitemaps

Sitemaps allow you to fine-tune the movement of SeznamBot around your web. Through a sitemap you can tell SeznamBot which pages change frequently, when a given page was last updated or what is its indexing priority within the site. Sitemaps are implemented through the Sitemap protocol which uses XML files that contain all information needed. You can find more information on sitemaps, including the exact syntax, on the official website sitemaps.org.

The sitemap directive syntax is: Sitemap: <absolute URL>