Crawling control

SeznamBot fully complies with the robots exclusion standard (or simply robots.txt), which specifies the rules  of robots behavior through a robots.txt file. A robots.txt file contains instructions that specify which content of the website the robots are / are not allowed to access and download. All robots visiting your web that follow this standard read this file first when accessing the web and they adjust their behavior according to the directives in the file. You can find detailed description of its syntax on the official website of the standard.

Using the robots.txt standard you can stop all crawling performed by SeznamBot on your web or only stop downloading of exactly specified pages. It typically takes several days for our crawler to recheck the restriction in the robots.txt file and eventually update the index, though for some sites that are not visited often, it can take up to several weeks. In case you only want to stop indexing of a page but allow SeznamBot to download and explore it, see Indexing Control. In that case you should also allow SeznamBot to download the page in the robots.txt file so that it is able to read the restrictions in the HTML code.

Tip

If you want to keep SeznamBot from accessing your site altogether, use the following directives in your robots.txt file:

User-agent: SeznamBot
Disallow: /

Nonstandard Extensions of robots.txt Syntax Recognized By SeznamBot

On top of the official version 1.0 standard SeznamBot recognizes other directives. These extensions are listed below in separate sections.

Allow Directive

The syntax of the Allow directive is the same as with the standard Disallow directive, except for the name. Use of the directive explicitly allows robots’ access to a given URL(s). This is useful when you want to instruct robots to avoid an entire directory but still want some HTML documents in that directory crawled and indexed.

Examples

User-agent: *
Disallow:
All robots can access and download all pages of the web. The empty space following the Disallow/Allow directive means that the directive doesn’t apply at all. This is the default (empty or nonexistent robots.txt file has the same effect).
User-agent: *
Allow:
User-agent: *
Disallow: /
No robot can download any page.
User-agent: *
Disallow: /archive/
Disallow: /abc
No robot can enter the /archive/ directory of the website. Furthermore, no robot can download any page with a name starting with “abc“.
User-agent: *
Disallow: /
Allow: /A/
Disallow: /A/B/
All robots can download files only from the /A/ directory and its subdirectories except for the subdirectory B/. The order of the directives is not important.
User-agent: SeznamBot
Disallow: /
SeznamBot can’t download anything from the website. Other robots are allowed by default.
User-agent: SeznamBot
Disallow: /discussion/
SeznamBot can’t download the /discussion/ directory. Other robots are allowed by default.

Wildcards

You can use the following wildcards in a robots.txt file:

*any number of any characters (an arbitrary string). Can be used multiple times in a directive.
$the end of the address string

Examples

User-agent: SeznamBot
Disallow: *.pdf$
Disallow downloading all files with addresses ending with „.pdf“ (regardless of the characters preceding it).
User-agent: SeznamBot
Disallow: /*/discussion/$
Disallow downloading the default document in any of the /discussion/ subdirectories while still allowing downloading all other files in those subdirectories.
User-agent: SeznamBot
Disallow: /discussion$
Disallow /discussion, allowing /discussion-01, /discussion/-02 etc.

Request-rate Directive

The Request-rate directive is used to tell robots how many documents from a website they can download during a given time period. Seznambot fully respects this directive, which enables you to set this download rate in a way that prevents your servers from being overloaded or even crushed. On the other hand, if you want your files to be processed by SeznamBot at a faster rate, you can set the Request-rate to a higher value.

The request rate directive syntax is: Request-rate: <number of documents>/<time>

You can also specify a time period in the day, during which the robot will observe the rate set by the directive. In the rest of the day, it will return to its regular behavior.

The general syntax in this case is: Request-rate: <rate> <time of day>

Examples

Request-rate: 1/10sRobots are allowed to download one document every ten seconds.
Request-rate: 100/15m100 documents every 15 minutes
Request-rate: 400/1h400 documents every hour
Request-rate: 9000/1d9000 documents every day
Request-rate: 1/10s 1800-1900Robots are allowed to download one document every ten seconds between 18:00 and 19:00 (UTC). In other times there are no limits for download rate.

Caution

The minimum download rate for SeznamBot is 1 document every 10 seconds. If you specify a lower value, SeznamBot will interpret it as this minimum rate. The maximum rate is only limited by the current speed of SeznamBot.

User-agent: * 
Disallow: /images/ 
Request-rate: 30/1m 
# all robots except for SeznamBot and Googlebot: do not access /images/ directory,
# rate 30 URLs per minute

User-agent: SeznamBot 
Disallow: /cz/chat/ 
Request-rate: 300/1m 
# SeznamBot: do not access /cz/chat/ directory, rate 300 URLs per minute 

User-agent: Googlebot 
Disallow: /logs/ 
Request-rate: 10/1m 
# Googlebot: do not access /logs/, rate 10 URLs per minute
User-agent: * 
Disallow: / 
# all robots except for SeznamBot: do not access anything 

User-agent: Seznambot 
Request-rate: 300/1m 
# Seznambot: access everything, rate 300 URLs per minute

Sitemaps

The sitemap directive syntax is: Sitemap: <absolute URL>