Robots.txt for SEO: The Ultimate Guide -璇泰初网络
Marketing Marketing Intelligence

Robots.txt for SEO: The Ultimate Guide

Robots.txt for SEO: The Ultimate Guide
comprehensive site audit. But your site might not need a robots.txt file. Without one, the Google bot will crawl through your entire site. This is exactly what you want it to do if you want your entire site to be indexed. You only need one if you want more control over what search engines crawl.

Here are the main scenarios in which you will need a robots.txt file:

1. Crawl budget optimization

Each website has a crawl budget. This means in a given time frame, Google will crawl a limited amount of pages on a site.

If the amount of pages on your site exceeds the crawl budget, you’ll have pages that don’t make it into Google’s index. And when your pages are not in Google’s index, there is very little chance of them ranking in search.

One easy way to optimize this is to make sure that search engine bots don’t crawl low-priority or non-essential content that doesn’t need frequent crawling. This could include duplicate pages, archives, or dynamically generated content that doesn’t significantly impact search rankings. This will save your crawl budget for the pages you do want indexed.

You can easily monitor nonessential sections of your site by setting up site segment analysis using Similarweb’s Website Segments tool. This will show you if those pages are getting indexed. Simply set up a segment that covers all your content. You can choose any rule, including:

  • Folders
  • Any variation of text
  • Exact text
  • Exact URLs

Below, we are setting up a segment for the /gp/ subfolder on amazon.com.

Creating a new website segment

Once your segment is set up, go to the Marketing Channels report and look at Organic Traffic. This will quickly show you if this site segment is getting traffic and eating up your crawl budget. Below, you can see that the segment we are tracking is getting 491.6K visits over the period of one year.

Marketing Channels report showing Organic Traffic

2. Avoiding duplicate content issues

For many sites, duplicate content is unavoidable. For instance, if you are running an ecommerce site and you have multiple product pages that could potentially rank on a single keyword. Robots.txt is an easy way to avoid this.

3. Prioritizing important content

By using the Allow: directive, you can explicitly permit search engines to crawl and index specific high-priority content on your site. This helps ensure that important pages are discovered and indexed.

4. Preventing indexing of admin or test areas

If your site has admin or test areas that should not be indexed, using Disallow: in the robots.txt file can help prevent search engines from including these areas in search results.

Track every aspect of your SEO

Get granular metrics to into your keyword rankings, organic pages, and SERP features.

Go to Similarweb

How does robots.txt work?

Robots.txt files inform search engine bots what pages to ignore and which pages to prioritize. To understand this, let’s first explore what bots do.

How search engine bots discover and index content

The job of a search engine is to make web content available to end users through search. To do this, search engine bots or spiders have to discover content by systematically visiting and analyzing web pages. This process is called crawling.

To discover information, search engine bots start by visiting a list of known web pages. They then follow links from one page to another across the net.

Search engine bot

Once a page is crawled, the information is parsed, and relevant data is stored in the search engine’s index. The index is a massive database that allows the search engine to quickly retrieve and display relevant results when a user performs a search query.

How do robots.txt files impact crawling and indexing?

When a bot lands on a site, it checks for a robots.txt file to determine how it should crawl and index the site. If the file is present, it provides instructions for crawling. If there’s no robots.txt file or it lacks crawling instructions, the bot will proceed to crawl the site.

The robots.txt file starts by specifying the user agent. A user agent refers to software that accesses web content, in our case, a search engine bot. It also includes directives such as

  • Allow:
  • Disallow:

For example:

User-agent: *

Disallow: /private/

Allow: /public/

Disallow: /restricted/

In this example:

  • User-agent: * applies the rules to all web crawlers.
  • Disallow: /private/ instructs all web crawlers to avoid crawling the /private/ directory.
  • Allow: /public/ allows all web crawlers to crawl the /public/ directory, even though there is a broader Disallow directive.
  • Disallow: /restricted/ further disallows crawling of the /restricted/ directory.

It’s important to note that robots.txt files are directives that search engine bots will generally follow. But if there are links pointing to a page that is disallowed, Google will still crawl that page and is likely to index it.

To avoid this, you should use noindex in the <head> section of the page’s HTML.

<meta name=”robots” content=”noindex”>

Implementing crawl directives: Understanding robots.txt syntax

A robots.txt file informs a search engine how to crawl by use of directives. A directive is a command that provides a system (in this case, a search engine bot) information on how to behave.

Each directive begins by first specifying the user-agent and then setting the rules for that user-agent. The user agent refers to the application that acts on behalf of a user when interacting with a system or network. In our case, the user agent refers to the web browser.

For example:

  • User-agent: Googlebot
  • User-agent: Bingbot

Below, we have compiled two lists; one contains supported directives and the other unsupported directives.

Supported Directives

Disallow: This directive prevents search engines from crawling certain areas of a website. You can:

  1. Block access to all directories for all user agents.
    user-agent: * (The ‘*’ is a wild card. See below.)
    Disallow: /
  2. Block a particular directory for all user agents.
    user-agent: *
    Disallow: /portfolio
  3. Block access to a PDF or any other files for all user agents using the appropriate file extension.
    user-agent: *
    Disallow: *.pdf

Allow: This directive allows search engines to crawl a page or directory. Use this directive to override a disallowed directive. Below we are blocking search engines from crawling the /portfolio folder but allowing them access to the /allowed-portfolio subfolder in the /portfolio folder.

user-agent: *
Disallow: /portfolio
Allow: /portfolio/allowed-portfolio

Sitemap: You can specify the location of your sitemap in your robots.txt file. A sitemap is a file on your site that provides a structured list of URLs to assist search engines in how to crawl your site.

Sitemap: https://www.example.com/sitemap.xml

If you want to understand more about directives, check out Google’s Google’s robots.txt guide.

Unsupported Directives

In 2019, Google posted that crawl-delay, nofollow, and noindex are not supported in robots.txt files. If you include them in your robots.txt files, they simply will not work. In reality, these rules were never supported by Google and were not intended to appear in robots.txt files but can be included in the robots’ meta tags on separate pages on your site.

There are other options if you want to exclude pages from Google’s index, including:

  • Using the meta tag with noindex:

Add the following HTML meta tag to the <head> section of the page’s HTML:

<meta name=”robots” content=”noindex”>

  • Using X-Robots-Tag HTTP header:

If you have access to server configuration, you can use the X-Robots-Tag HTTP header to achieve a similar result.

For example:

X-Robots-Tag: noindex

  • Using Google Search Console:

You can use the URL Removal Tool to request the removal of a specific URL from Google’s index.

Since crawl-delay is not supported by Google, if you want to ask Google to crawl slower, you can set the crawl rate in Google Search Console.

Using wildcards

Wildcards are characters you can use to provide directives that apply to multiple URLs at once. The two main wildcards used in robots.txt files are the asterisk (*) and the dollar sign ($).

You can use them to apply directives or to user agents.

For example:

  1. Asterix (*): When applied to user agents, the wild card means “apply to all user agents.” When applied to URLs, it means “apply to all URLs.” If you have URLs that follow the same pattern, this will save you time.
  2. Dollar sign ($): The dollar sign is used at the end of a URL pattern to match URLs that end with a specific string.

User-agent: *

Disallow: /*.pdf$

In the example below, we are blocking search engines from crawling all PDF files.

user-agent: * 

Disallow: /*.pdf$

URLs that end with .pdf will not be accessible. But take note that if your URL has additional text after the .pdf ending, then that URL will be accessible.

How to create robots.txt files

If your website doesn’t have a robots.txt file, you can easily create one in a text editor. Simply open a blank .txt document and insert your directives. When you are finished, just save the file as ‘robots.txt,’ and there you have it.

Now, you might be wondering where to put your robots.txt file.

In theory, you can put it in any main directory on your site, but to ensure that bots find it, we recommend uploading it to your root directory.

Next, upload it to the root directory of your website. Make sure it is accessible via a web browser at the path https://www.yourdomain.com/robots.txt. If you want to test how effective your robots.txt file is you can test any URL with the Google Search Console URL Inspection tool.

Google Search Console URL inspection tool

How to add robots.txt to WordPress

WordPress in Similarweb

If you use WordPress, the easiest way to create a robots.txt file in WordPress is to use plugins like Yoast and All in One SEO Pack.

If you use Yoast, go to SEO > Tools > File Editor. Click on the robots.txt tab, and you can create or edit your robots.txt file there.

If you use All in One SEO Pack, go to One SEO > Feature Manager. Activate the “Robots.txt” feature, and you can configure your directives from there.

Common mistakes you want to avoid

Although there are many benefits to using robot.txt, getting them wrong can kill your traffic. Let’s get into some mistakes to avoid.

  • Blocking important content: By using overly restrictive rules, you might accidentally restrict important sections of your site
  • Blocking CSS, JavaScript, and Image files: Some search engines use these resources to understand the structure of your site
  • Incorrect case sensitivity: Robots.txt files are case sensitive
  • Assuming security through robots.txt: Sensitive content should be protected by other means as robots.txt is a guideline but does not ensure that pages will not be indexed
  • Incorrect syntax: Validate your files as typos can lead to search engines misinterpreting your robots.txt files

Robots.txt files: The final word

You now have a comprehensive understanding of robots.txt files. You know what they are, how they work, and how they can be used to enhance your SEO. Just remember always to review and test your robots.txt  files. Done right, they will serve you well; done wrong they might just mean the end of your organic traffic.

Use them wisely.

Download your copy of the indestructible SEO strategy guide

All the elements you need to build a successful SEO strategy

FAQs

What is the robots.txt file?

Robots.txt is a text file located in the root directory of a site and is used to inform web crawlers how to crawl and index the site.

How do I access a robots.txt file?

The easiest way to access a robots.txt file is to type the site’s URL into your browser and then add /robots.txt to the end. It should look like this: https://www.example.com/robots.txt.

Is robots.txt good for SEO?

The robots.txt file plays an important role in SEO. Although they don’t directly impact a website’s rankings, they help search engines understand the site’s structure and which pages to include or exclude from their index.

The robots.txt file can contribute to SEO by:

  • Controlling Crawling
  • Preserving Crawl Budgets
  • Managing Sitemaps
  • Preventing Indexation of Duplicate Content

It’s important to note that robots.txt files should be used carefully. Incorrectly configuring the file can inadvertently block search engines from accessing important content, leading to a negative impact on your site’s visibility.

When should you use a robots.txt file?

Use a robots.txt file to control search engine crawling. Restrict sensitive areas, prevent indexing of duplicate content, manage crawl budget, and guide bots away from non-essential or private content.

author-photo

by Darrell Mordecai

Darrell creates SEO content for Similarweb, drawing on his deep understanding of SEO and Google patents.

This post is subject to Similarweb legal notices and disclaimers.

The #1 keyword research tool

Give it a try or talk to our marketing team — don’t worry, it’s free!

Would you like a free trial?
Wouldn’t it be awesome to see competitors' metrics?
Stop guessing and start basing your decisions on real competitive data
Now you can! Using Similarweb data. So what are you waiting for?
Ready to start digging into the data?
Our comprehensive view of digital traffic gives you the insights you need to win online.

相关内容推荐

南沙seo外包服务推广冬镜seo忘羡晋安区网页seo金湖抖音seo批发长沙seo网站推广厂家塘沽seo排名哪家好荆门seo排名技术厂家智能seo优化大概费用专门站seo推广方法学seo有前景吗seo来择火星6怎么样seo推广津桥留学招聘SEO简阳seo优化怎么收费掌握网站推广的seo附近的seo优化推荐付子seo官网2019seo发包技术天津seo实用技巧海南seo有哪些优势付费seo营销费用多少北京seo推广优质团队阿里巴巴网站seo百度seo 运营seo刷题带答案seo实训优化总结seo外链在线推广seo推广推荐火星软件三维思考seo网络seo推广价格多少互联网seo排名永川抖音seo公司seo5288 快排济南单页面seo优化津桥留学招聘SEO兰州快站点seo培训培训seo好不好坚持做seo的好处盘锦seo快排服务必火seo加盟代理农产品seo描述德州关键词seo大兴抖音seo排名网站seo实操要点seo的认识分析报告餐饮seo计划书网站seo有什么作用seo优化如何选择域名上海林频仪器seo什么公司网站seo优化无锡seo推广引流平台SEO技术文案配图营口seo公司便捷火星扶沟网站seo优化价格南通seo公司名字宝马seo失败的原因石家庄排名seo电话seo网站需要改版吗谷歌seo怎么工作赚钱广州抖音seo免费日照seo有什么优势seo需要会js吗wordpress建站seo好做吗厦门seo排名规则公司垦利seo快速优化软件吉林网站seo费用深圳正规seo优化代理企业刚开始做seo窜天猴seo原理seo知识点汇总长沙seo专业术语付必鹏谈seoseo教程挂在闲鱼云南seo培训学多久seo有没有快速排名seo工作排期表seo外链曲线教程潍坊SEO整站优化价格好的seo优化服务深圳seo优化快速排名seo发文章外链新乡seo网站优化排名seo优化可以做什么白云抖音seo优化天猫seo优化规则杭州seo搜索矩阵平台海南seo网络培训公司seo 谷歌竞价排名seo名词解释英文seo标点符号空格深圳抖音seo技巧黄冈seo排名报价多少seo技术精准吸粉什么公司需要seo专员seo有哪些点击软件兰州seo快速优化价格seo公司首荐火星seo诊断是真的吗杭州做seo哪里最好汕头seo报价及图片惠州seo投放项目公司会计seo是指什么武汉seo按天付费SEO基础舞蹈简单学生有seo团队 寻合作seo计算公式大全seo服务怎么这么贵SEO联盟名字女网名SEO文案短句沙雕seo网站推广全程实例全网营销推广seo博客抖音seo广告标识seo优化师哪里接单某网站SEO分析报告seo前线zero培训视频不花钱seo自然排名大学城seo网站网站seo只更新博客英山seo优化哪家好简单seo联系方式福州seo服务商微信seo精准引流谷歌seo前端代码php淄博seo哪家实力强seo策划书范文leur和seo的区别抖音seo伪概念seo对企业的好处2B企业SEOseo任县天气预报张家界seo排名黄埔区seo电话初创公司seo怎么选黑帽SEO 经典书籍湖州抖音seo团队双鸭山抖音seo排名辽阳企业网络seo小白站长seo怎么用seo优化博客有哪些seo排名软件哪个好点移动端seo都是广告重庆seo服务外包平台php怎么添加seo字贵阳新站seo方案seo排名发包工具济宁抖音seo品牌广州新塘网站SEO优化潍坊靠谱seo平台seo 中文翻译为遵义抖音seo方式湖北抖音seo厂家日照seo推广培训公司jeong seo-yoon作品seo海外推广招聘信息长春专业seo服务价格山西seo矩阵哪个好郑州搜狗seo优化价格简阳seo优化推广软件SEO入门画画赚钱图片低质量内容影响seo房产seo优化口碑好新疆外贸seo推广公司斗牛seo工具集2017SEO入门书籍推荐历史seo推广 多少钱阜阳seo优化费用seo是青春饭吗seo快速排名怎样做新泰专业seo网站优化包头seo公司便捷火星站长之家的seo系统广东短视频推广seo商丘seo营销推广优化seo引流最快的方法白云抖音seo优化宁夏短视频seo运营南城seo排名优化香河seo按效果付费seo基础甄选16火星江阴seo排名优化seo优化破解软件工具seo来择火星6网站seo怎样写seo 开发技术培训vue 的seo是什么SEO大牛美食大全蛋糕合肥谷歌seo厂家地址贵阳seo数据分析张家港seo优化外贸站怎么更新seoseo3短视频什么seo做的最好镇江网优化seo公司百度seo步骤阜新seo公司解答火星如何对seo进行优邳州百度seo橡塑seo优化广告seo公司推荐6火星这个seo该怎么优化

合作伙伴

璇泰初网络

idc.urkeji.com
www.urkeji.com
www.lyhbj.cn
www.clhczx.cn
seo.07yue.com
www.xtcwl.com
www.kmpower.cn
www.maijichuang.cn
jl.urkeji.com
qiansan.seo5951.com
idc.urkeji.com
www.07yue.com
www.bbswimming.cn
www.jsfengchao.com
www.weiwin.cc
www.mtcddc.cn
www.chaoshanxing.com
www.maijichuang.cn
www.xm5656.cn
jl.urkeji.com