Baiduspider Help Center —— How to block the crawlingHelp center
How to block the crawling

1. What is robots.txt?

Search engine uses spider to visit the sites in the Internet automatically and get contents from the sites. Before they access a site, they will check whether there is a robots.txt file located at the root directory of the site. This robots.txt sets the crawling scope of the site. Hence, you can create a robots.txt file to inform the search engine about the contents that you prefer or not prefer to be indexed.
Please note that you need a robots.txt file only if your site includes contents that you do not prefer the search engine to index. If you prefer the search engine to index everything in your site, please do not create the robots.txt file.

2. Where do I place my robots.txt file?

The robots.txt file must be placed at the root directory of the site. For example, when spider wants to access a site http://www.baidu-example.com, it will first check whether there is a http://www.baidu-example.com/robots.txt file. If the file exists, spider will access the contents and work according to the instructions.

Web site URL robots.txt URL
http://www.w3.org/ http://www.w3.org/robots.txt
http://www.w3.org:80/ http://www.w3.org:80/robots.txt
http://www.w3.org:1234/ http://www.w3.org:1234/robots.txt
http://w3.org/ http://w3.org/robots.txt

3. I have already created a robots.txt to block Baiduspider from indexing my site, however, the contents still appear in Baidu search results. Why?

If the pages that you have blocked are linked by other sites, these pages may appear in Baidu search results. However, Baidu will not crawl, index or show the content of pages blocked by robots.txt. The contents appear in Baidu search results are actually the descriptions given by other sites.

4. How to request search engine to build the index only but not follow the links on the page?

If you do not prefer search engine to follow the links and the information related of these links, you can add this in the <HEAD> of the page

<meta name="robots" content="nofollow">

If you do not prefer a specific link to be followed, however, Baidu still work on it. You can add the following to the specific link:

<a href="signin.php" rel="nofollow">sign in</a>

If you allow all the search t but Baiduspider to follow links of you page, you can add the following to theof the page:

<meta name="Baiduspider" content="nofollow">

5. How to request search engine to build the index only but not showing the cache in the results?

To block all the search engines from showing the caches of pages, you can add the following to the <HEAD>of the pages:

<meta name="robots" content="noarchive">

If you allow all the search engines but Baiduspider to show the caches of your site, you can add the following to the <HEAD>of the pages:

<meta name="Baiduspider" content="noarchive">

Note: The meta tag will disable Baiduspider to show the cache of your site, however, Baidu will continue build index and show the snippets of the pages in the results.

6. How do I stop Baiduspider from crawling certain images?

To stop Baiduspider from crawling all the images or images of some specific file types, you can specify them in the robots.txt. For details, please refer to example 8.

7. What is the format of robots.txt file?

The file must has one or morerecords, separated by a blank line (denoted by CR, CR/NL, or NL). The format of each record is below:

"<field>:<optional space><value><optional space>"

In the file, comments can be included by using the '#' character. The format will be the same as that used in Unix. The record starts with one or more User-agent, followed by one or more “Disallow”or “Allow” command as below:

User-agent:
    This value refers to the name of the robot that used by search engine. The number of User-agent record refers to the number of robot which guided by robots.txt. In any robots.txt file, there should be at least one record. If the value is set to '*', it takes effect in any robot. A robots.txt file only allows to have one “User-agent:*” record. If "User-agent: SomeBot" and certain Disallow/Allow records are found, the policy named “SomeBot” will be guided by the Disallow/Allow command that followed by “User-agent:SomeBot”.

Disallow:
    The value of this field specifies the URLs that do not prefer to be visited. This can be a full path or a partial path. Any URL that starts with this value will not be retrieved. For example, “Disallow: /help” disallows robot to visit “/help.html”, “/helpabc.html” and “/help/index.html”, whereas “Disallow: /help/” allows robot to visit “/help/html” and “/helpabc.html” but not “/help/index.html”. “Disallow:” specifies all the urls that a robot could visited. There should be at least one“Disallow” record in a robots.txt file. If the robot.txt file is not exists or the content of the file is null, it means the entire site is opened to all the search engines.

Allow:
    The value of this field specifies the URLs that are allowed to be visited. This can be a full path or a partial path. Any URL that starts with this value will be retrieved. For example, "Allow:/hibaidu" allows “/hibaidu.htm”, “/hibaiducom.html” and “/hibaidu/com.html”. The default policy is set to all the URLs of a site are allowed to be visited. Hence, you can use Allow and Disallow records to inform the search engine which URLs can be assessed and which cannot.

Please note that the order of Disallow and Allow records is important as robot takes the records in sequence from the beginning of the file. Once matched, robot will perform the job accordingly.

"*"and"$":
Baiduspider supports wildcard characters "*"and"$" in matching with url
"$" matches the end-of-line character
"*" matches 0 or multiple arbitrary characters

8. Examples of URL matching

The value of Allow or Disallow URL Matching result
/tmp /tmp yes
/tmp /tmp.html yes
/tmp /tmp/a.html yes
/tmp /tmp no
/tmp /tmphoho no
/tmp /tmp/a.html yes
 
/Hello* /Hello.html yes
/He*lo /Hello,lolo yes
/Heap*lo /Hello,lolo no
html$ /tmpa.html yes
/a.html$ /a.html yes
htm$ /a.html no

9. Examples of robots.txt file

Example 1. To prevent search engine from accessing the entire site
downroad the robots.txt file
User-agent: *
Disallow: /
Example 2. To allow all the robot access (or create a empty file: "/robots.txt") User-agent: *
Disallow: /
or
User-agent: *
Allow: /
Example 3. To stop only Baiduspider from accessing your site User-agent:
Baiduspider
Disallow: /
Example 4. To allow Baiduspider only to access your site User-agent: Baiduspider
Disallow:

User-agent: *
Allow: /
Example 5. To prevent Spider from accessing particular directories
In this example, there are 3 directories with limitation on accessing are set. This means that robot will not visit this 3 directories. Please note that you are required to separate "Disallow" line for every URL. The records of "Disallow: /cgi-bin/ /tmp/" will not work.
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
Example 6. To allow access to ULRs on specified directories User-agent: *
Allow: /cgi-bin/see
Allow: /tmp/hi
Allow: /~joe/look
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
Example 7. Use “*” to set access to url
To disable the access to the url ended with “.htm” which located under the directory /cgi-bin/ (including its subdirectories).
User-agent: *
Disallow:
/cgi-bin/*.htm
Example 8. Use “$” to set access to url
To allow access to those URLs which ended with ".htm".
User-agent: *
Allow: .htm$
Disallow: /
Example 9. To prevent access to all the dynamic pages User-agent: *
Disallow: /*?*
Example 10. To stop Baiduspider from crawling all the images on the site
Only pages but not images are allowed to be crawled.
User-agent:
Baiduspider
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /*.gif$
Disallow: /*.png$
Disallow: /*.bmp$
Example 11. To allow Baiduspider only to crawl the pages and the images with file type “.gif”
Only the pages and the file type .gif is allowed to be crawled.
User-agent:
Baiduspider
Allow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /*.png$
Disallow: /*.bmp$
Example 12. To prevent Baiduspider only from crawling images with file type .jpg User-agent:
Baiduspider
Disallow: /*.jpg$

10. Reference of robots.txt file

For details on how to build a robots.txt, please check below links.
Web Server Administrator's Guide to the Robots Exclusion Protocol
HTML Author's Guide to the Robots Exclusion Protocol
The original 1994 protocol description, as currently deployed
The revised Internet-Draft specification, which is not yet completed or implemented

Back to top
©2011 Baidu Disclaimer