{"id":1709,"date":"2024-10-23T17:32:25","date_gmt":"2024-10-23T15:32:25","guid":{"rendered":"https:\/\/o-seznam.cz\/napoveda\/vyhledavani\/?page_id=1709"},"modified":"2025-01-29T12:45:37","modified_gmt":"2025-01-29T11:45:37","slug":"crawling-control","status":"publish","type":"page","link":"https:\/\/o-seznam.cz\/napoveda\/vyhledavani\/en\/crawling-control\/","title":{"rendered":"Crawling control"},"content":{"rendered":"\n<p>SeznamBot fully complies with the <a class=\"is-style-link-external\" href=\"https:\/\/en.wikipedia.org\/wiki\/Robots-exclusion-standard\">robots exclusion standard<\/a> (or simply <strong>robots.txt<\/strong>), which specifies the rules&nbsp; of robots behavior through a <code>robots.txt<\/code> file. A <code>robots.txt<\/code> file contains instructions that specify which content of the website the robots are \/ are not allowed to access and download. All robots visiting your web that follow this standard read this file first when accessing the web and they adjust their behavior according to the directives in the file. You can find detailed description of its syntax on the <a class=\"is-style-link-external\" href=\"http:\/\/www.robotstxt.org\/orig.html\">official website<\/a> of the standard.<\/p>\n\n\n\n<p>Using the robots.txt standard you can stop all crawling performed by SeznamBot on your web or only stop downloading of exactly specified pages. It typically takes several days for our crawler to recheck the restriction in the <code>robots.txt<\/code> file and eventually update the index, though for some sites that are not visited often, it can take up to several weeks. In case you only want to stop <em>indexing<\/em> of a page but allow SeznamBot to download and explore it, see <a href=\"..\/indexing-control\/\">Indexing Control<\/a>. In that case you should also allow SeznamBot to download the page in the <code>robots.txt<\/code> file so that it is able to read the restrictions in the HTML code.<\/p>\n\n\n\n<div class=\"wp-block-seznam-box is-layout-flow wp-block-seznam-box-is-layout-flow\">\n<h2 class=\"wp-block-heading\">Tip<\/h2>\n\n\n\n<p>If you want to keep SeznamBot from accessing your site altogether, use the following directives in your <code>robots.txt<\/code> file:<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p><code>User-agent: SeznamBot<br>Disallow: \/<\/code><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\"><\/div>\n<\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"syntax-robots.txt\">Nonstandard Extensions of robots.txt Syntax Recognized By SeznamBot<\/h2>\n\n\n\n<p>On top of the <a class=\"is-style-link-external\" href=\"http:\/\/www.robotstxt.org\/orig.html\">official version 1.0 standard<\/a> SeznamBot recognizes other directives. These extensions are listed below in separate sections.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Allow Directive<\/h2>\n\n\n\n<p>The syntax of the Allow directive is the same as with the standard Disallow directive, except for the name. Use of the directive explicitly allows robots&#8216; access to a given URL(s). This is useful when you want to instruct robots to avoid an entire directory but still want some HTML documents in that directory crawled and indexed.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Examples<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><code>User-agent: *<br>Disallow:<\/code><\/td><td>All robots can access and download all pages of the web. The empty space following the <code>Disallow<\/code>\/<code>Allow<\/code> directive means that the directive doesn&#8217;t apply at all. This is the default (empty or nonexistent <code>robots.txt<\/code> file has the same effect).<\/td><\/tr><tr><td><code>User-agent: *<br>Allow:<\/code><\/td><\/tr><tr><td><code>User-agent: *<br>Disallow: \/ <\/code><\/td><td>No robot can download any page.<\/td><\/tr><tr><td><code>User-agent: *<br>Disallow: \/archive\/<br>Disallow: \/abc<\/code><\/td><td>No robot can enter the <em>\/archive\/<\/em> directory of the website. Furthermore, no robot can download any page with a name starting with &#8222;<em>abc<\/em>&#8222;.<\/td><\/tr><tr><td><code>User-agent: *<br>Disallow: \/<br>Allow: \/A\/<br>Disallow: \/A\/B\/<\/code><\/td><td>All robots can download files only from the <em>\/A\/<\/em> directory and its subdirectories except for the subdirectory <strong>B\/<\/strong>. The order of the directives is not important.<\/td><\/tr><tr><td><code>User-agent: SeznamBot<br>Disallow: \/ <\/code><\/td><td>SeznamBot can&#8217;t download anything from the website. Other robots are allowed by default.<\/td><\/tr><tr><td><code>User-agent: SeznamBot<br>Disallow: \/discussion\/ <\/code><\/td><td>SeznamBot can&#8217;t download the <em>\/discussion\/<\/em> directory. Other robots are allowed by default.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Wildcards<\/h2>\n\n\n\n<p>You can use the following wildcards in a <code>robots.txt<\/code> file:<\/p>\n\n\n\n<figure class=\"wp-block-table fulltxt mceItemTable\"><table><tbody><tr><td>*<\/td><td>any number of any characters (an arbitrary string). Can be used multiple times in a directive.<\/td><\/tr><tr><td>$<\/td><td>the end of the address string<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Examples<\/h4>\n\n\n\n<figure class=\"wp-block-table fulltxt mceItemTable\"><table><tbody><tr><td><code>User-agent: SeznamBot<br>Disallow: *.pdf$<\/code><\/td><td>Disallow downloading all files with addresses ending with \u201e<em>.pdf<\/em>\u201c (regardless of the characters preceding it).<\/td><\/tr><tr><td><code>User-agent: SeznamBot<br>Disallow: \/*\/discussion\/$<\/code><\/td><td>Disallow downloading the default document in any of the <em>\/discussion\/<\/em> subdirectories while still allowing downloading all other files in those subdirectories.<\/td><\/tr><tr><td><code>User-agent: SeznamBot<br>Disallow: \/discussion$<\/code><\/td><td>Disallow \/<em>discussion<\/em>, allowing <em>\/discussion-01<\/em>, <em>\/discussion\/-02 <\/em>etc.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"request-rate-directive\">Request-rate Directive<\/h2>\n\n\n\n<p>The Request-rate directive is used to tell robots how many documents from a website they can download during a given time period. Seznambot fully respects this directive, which enables you to set this download rate in a way that prevents your servers from being overloaded or even crushed. On the other hand, if you want your files to be processed by SeznamBot at a faster rate, you can set the Request-rate to a higher value.<\/p>\n\n\n\n<p>The request rate directive syntax is: <code>Request-rate: &lt;number of documents&gt;\/&lt;time&gt;<\/code><\/p>\n\n\n\n<p>You can also specify a time period in the day, during which the robot will observe the rate set by the directive. In the rest of the day, it will return to its regular behavior.<\/p>\n\n\n\n<p>The general syntax in this case is: <code>Request-rate: &lt;rate&gt; &lt;time of day&gt;<\/code><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Examples<\/h4>\n\n\n\n<figure class=\"wp-block-table fulltxt mceItemTable\"><table><tbody><tr><td><code>Request-rate: 1\/10<strong>s<\/strong><\/code><\/td><td>Robots are allowed to download one document every ten seconds.<\/td><\/tr><tr><td><code>Request-rate: 100\/15<strong>m<\/strong><\/code><\/td><td>100 documents every 15 minutes<\/td><\/tr><tr><td><code>Request-rate: 400\/1<strong>h<\/strong><\/code><\/td><td>400 documents every hour<\/td><\/tr><tr><td><code>Request-rate: 9000\/1<strong>d<\/strong><\/code><\/td><td>9000 documents every day<\/td><\/tr><tr><td><code>Request-rate: 1\/10<strong>s<\/strong> 1800-1900<\/code><\/td><td>Robots are allowed to download one document every ten seconds between 18:00 and 19:00 (UTC). In other times there are no limits for download rate.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<div class=\"wp-block-seznam-box is-layout-flow wp-block-seznam-box-is-layout-flow\">\n<h2 class=\"wp-block-heading\">Caution<\/h2>\n\n\n\n<p>The minimum download rate for SeznamBot is 1 document every 10 seconds. If you specify a lower value, SeznamBot will interpret it as this minimum rate. The maximum rate is only limited by the current speed of SeznamBot.<\/p>\n<\/div>\n\n\n\n<pre title=\"Examples (specific and all other robots)\" class=\"wp-block-code\"><code lang=\"adoc\" class=\"language-adoc line-numbers\">User-agent: * \nDisallow: \/images\/ \nRequest-rate: 30\/1m \n# all robots except for SeznamBot and Googlebot: do not access \/images\/ directory,\n# rate 30 URLs per minute\n\nUser-agent: SeznamBot \nDisallow: \/cz\/chat\/ \nRequest-rate: 300\/1m \n# SeznamBot: do not access \/cz\/chat\/ directory, rate 300 URLs per minute \n\nUser-agent: Googlebot \nDisallow: \/logs\/ \nRequest-rate: 10\/1m \n# Googlebot: do not access \/logs\/, rate 10 URLs per minute<\/code><\/pre>\n\n\n\n<pre title=\"Examples (SeznamBot and all other robots\" class=\"wp-block-code\"><code lang=\"adoc\" class=\"language-adoc line-numbers\">User-agent: * \nDisallow: \/ \n# all robots except for SeznamBot: do not access anything \n\nUser-agent: Seznambot \nRequest-rate: 300\/1m \n# Seznambot: access everything, rate 300 URLs per minute<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Sitemaps<\/h2>\n\n\n\n<p class=\"is-style-link-external\">Sitemaps allow you to fine-tune the movement of SeznamBot around your web. Through a sitemap you can tell SeznamBot which pages change frequently, when a given page was last updated or what is its indexing priority within the site. Sitemaps are implemented through the Sitemap protocol which uses XML files that contain all information needed. You can find more information on sitemaps, including the exact syntax, on the official website <a href=\"http:\/\/www.sitemaps.org\/protocol.html\">sitemaps.org<\/a>.<\/p>\n\n\n\n<p>The sitemap directive syntax is: <code>Sitemap: &lt;absolute URL&gt;<\/code><\/p>\n","protected":false},"excerpt":{"rendered":"<p>SeznamBot fully complies with the robots exclusion standard (or simply robots.txt), which specifies the rules&nbsp; of robots behavior through a robots.txt file. A robots.txt file contains instructions that specify which content of the website the robots are \/ are not allowed to access and download. All robots visiting your web that follow this standard read [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":0,"parent":0,"menu_order":2,"comment_status":"closed","ping_status":"closed","template":"page-sidemenu","meta":{"_acf_changed":false,"_relevanssi_hide_post":"","_relevanssi_hide_content":"","_relevanssi_pin_for_all":"","_relevanssi_pin_keywords":"","_relevanssi_unpin_keywords":"","_relevanssi_related_keywords":"","_relevanssi_related_include_ids":"","_relevanssi_related_exclude_ids":"","_relevanssi_related_no_append":"","_relevanssi_related_not_related":"","_relevanssi_related_posts":"","_relevanssi_noindex_reason":"","footnotes":""},"page_category":[28],"page_tag":[],"service_status_category":[],"class_list":["post-1709","page","type-page","status-publish","hentry"],"acf":[],"lang":"en","translations":{"en":1709,"cs":73},"pll_sync_post":[],"_links":{"self":[{"href":"https:\/\/o-seznam.cz\/napoveda\/vyhledavani\/wp-json\/wp\/v2\/pages\/1709","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/o-seznam.cz\/napoveda\/vyhledavani\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/o-seznam.cz\/napoveda\/vyhledavani\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/o-seznam.cz\/napoveda\/vyhledavani\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/o-seznam.cz\/napoveda\/vyhledavani\/wp-json\/wp\/v2\/comments?post=1709"}],"version-history":[{"count":4,"href":"https:\/\/o-seznam.cz\/napoveda\/vyhledavani\/wp-json\/wp\/v2\/pages\/1709\/revisions"}],"predecessor-version":[{"id":1992,"href":"https:\/\/o-seznam.cz\/napoveda\/vyhledavani\/wp-json\/wp\/v2\/pages\/1709\/revisions\/1992"}],"wp:attachment":[{"href":"https:\/\/o-seznam.cz\/napoveda\/vyhledavani\/wp-json\/wp\/v2\/media?parent=1709"}],"wp:term":[{"taxonomy":"page_category","embeddable":true,"href":"https:\/\/o-seznam.cz\/napoveda\/vyhledavani\/wp-json\/wp\/v2\/page_category?post=1709"},{"taxonomy":"page_tag","embeddable":true,"href":"https:\/\/o-seznam.cz\/napoveda\/vyhledavani\/wp-json\/wp\/v2\/page_tag?post=1709"},{"taxonomy":"service_status_category","embeddable":true,"href":"https:\/\/o-seznam.cz\/napoveda\/vyhledavani\/wp-json\/wp\/v2\/service_status_category?post=1709"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}