Reference

robots.txt for AI Crawlers — Complete Guide (2026)

Copy-paste templates and the complete directive reference for controlling AI crawlers via robots.txt. Covers GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-Web, Google-Extended, Applebot-Extended, PerplexityBot, Meta-ExternalAgent, Bingbot, and Bytespider.

What robots.txt is

robots.txt is a plain-text file at the root of your domain (yourdomain.com/robots.txt) that tells web crawlers which URLs they may or may not fetch. The Robots Exclusion Protocol has been a web standard since 1994. Every major AI crawler operating in 2026 — including OpenAI, Anthropic, Google, Microsoft, Apple, Meta, Perplexity — publicly commits to respecting it.

For AI visibility, robots.txt is the single most important configuration on your site. It determines whether your content can be used as training data, cited in real-time AI answers, and indexed by AI-search products like ChatGPT Search.

Approximately 40% of websites accidentally block at least one major AI crawler due to overly strict default robots.txt files inherited from CMS templates or copied from older SEO guides. Always audit yours.

The basic directives

DirectivePurposeExample
User-agentNames the crawler the next rules apply to. Wildcard * = all crawlers.User-agent: GPTBot
AllowExplicitly permits a path. Used to override a broader Disallow.Allow: /docs/
DisallowBlocks a path. Empty value = block nothing (i.e., allow all).Disallow: /admin/
SitemapPoints crawlers to your sitemap.xml. Independent of allow/disallow rules.Sitemap: /sitemap.xml

Rules are evaluated per-crawler. A User-agent declaration starts a block; everything until the next User-agent applies only to that bot.

Template: allow all AI crawlers

Recommended for any site that wants AI visibility. Drop this into your robots.txt to explicitly permit every major AI crawler. Explicit allow rules are clearer documentation than implicit defaults.

# Allow all major AI crawlers — recommended for AEO

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

User-agent: Meta-ExternalFetcher
Allow: /

User-agent: bingbot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Template: block all AI crawlers

Use only if you have a clear reason — typically copyright-sensitive publishers or businesses with confidential first-party data. Blocking all AI crawlers means your category gets answered by everyone except you.

# Block all major AI crawlers — visibility cost is high

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

Template: block training, allow real-time

The middle path — your content doesn't train future LLMs, but AI assistants can still cite you live when users ask questions. Used by news publishers, premium content sites, and brands worried about copyright but unwilling to lose AI search visibility.

# Block training crawlers — keep real-time + search visibility

# ── BLOCK: training crawlers ──
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

# ── ALLOW: real-time + search crawlers ──
User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Meta-ExternalFetcher
Allow: /

User-agent: bingbot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Template: allow only docs and blog

For SaaS products that want AI to learn from public docs and blog posts but not from app pages, account dashboards, or internal tooling.

# Allow AI on docs + blog only

User-agent: GPTBot
Disallow: /
Allow: /docs/
Allow: /blog/

User-agent: ChatGPT-User
Disallow: /
Allow: /docs/
Allow: /blog/

User-agent: ClaudeBot
Disallow: /
Allow: /docs/
Allow: /blog/

User-agent: PerplexityBot
Disallow: /
Allow: /docs/
Allow: /blog/

# Block app routes from all AI crawlers
User-agent: *
Disallow: /dashboard/
Disallow: /app/
Disallow: /api/

Sitemap: https://yourdomain.com/sitemap.xml

Common mistakes

  • Blocking all bots with `User-agent: *` `Disallow: /`

    This blocks every crawler — Google, Bing, AI bots, social previews. If you wanted to block only AI, you locked yourself out of search and link previews too.

  • Using `Disallow: /*`

    The `*` is treated as a literal in some implementations. Use `Disallow: /` for an absolute block.

  • Mixing crawler-specific and wildcard rules

    If you have `User-agent: *` followed by allow/disallow rules, then add a `User-agent: GPTBot` block, GPTBot ignores the wildcard rules entirely. Each User-agent block is fully self-contained.

  • Forgetting the sitemap directive

    AI crawlers use sitemap.xml to discover content beyond your homepage. Always include `Sitemap: https://yourdomain.com/sitemap.xml`.

  • Editing robots.txt only on staging

    Verify the live URL — `https://yourdomain.com/robots.txt` — returns the file you intended. CDN caching, framework-level overrides, or path mismatches are common.

  • Expecting retroactive removal

    robots.txt only affects future crawls. Content already in training data isn't removed when you add a Disallow. To opt out of existing data, contact the provider directly.

How to verify your robots.txt is working

  1. Visit it directly: open https://yourdomain.com/robots.txt in a browser. The file should load as plain text.
  2. Check the response code: it must return 200 OK. A 404 means no robots.txt — every crawler will fetch everything by default.
  3. Run an analyzer: paste your URL into a free analyzer that tests each AI crawler against your rules. Catches subtle issues (wildcard conflicts, mixed allow/disallow, encoding problems) automatically.
  4. Check server logs: after 1–3 days, look for User-agent strings like GPTBot or ClaudeBot in your access logs. They confirm crawlers are reading the file and following rules.

Free robots.txt analyzer

Tests your robots.txt against every major AI crawler in 30 seconds. Per-bot allow/block status with copy-paste fixes.

Analyze yours

Frequently asked questions

Where does robots.txt go?

robots.txt is a plain-text file placed at the root of your domain — yourdomain.com/robots.txt. It must be served with HTTP 200 and Content-Type: text/plain. Crawlers fetch it before crawling any other URL on the domain.

How do I allow GPTBot in robots.txt?

Add: User-agent: GPTBot followed by Allow: / on the next line. Or omit any rule for GPTBot — by default, crawlers are allowed if there's no Disallow directive matching them. Explicit allow is clearer documentation.

How do I block ChatGPT entirely?

ChatGPT uses three crawlers: GPTBot (training), ChatGPT-User (real-time browsing), and OAI-SearchBot (ChatGPT Search index). To block ChatGPT entirely, disallow all three: User-agent: GPTBot / Disallow: / then User-agent: ChatGPT-User / Disallow: / then User-agent: OAI-SearchBot / Disallow: /. Blocking just GPTBot still leaves you visible in ChatGPT Search and live queries.

Will my changes to robots.txt take effect immediately?

AI crawlers re-fetch robots.txt before each crawl session, typically within 24-72 hours of a change. Existing training data isn't removed retroactively — robots.txt only controls future crawls. To remove already-indexed content from a model, you generally need to contact the provider directly.

Can I use Allow and Disallow together?

Yes. The most specific matching rule wins. For example, you can disallow your site root but allow a specific section: User-agent: GPTBot / Disallow: / / Allow: /docs/. This blocks GPTBot from everything except /docs/. Useful for opening up only documentation while keeping the rest private.

Do AI crawlers obey robots.txt?

All major AI crawlers — OpenAI's GPTBot/ChatGPT-User/OAI-SearchBot, Anthropic's ClaudeBot/Claude-Web, Google's Google-Extended, Apple's Applebot-Extended, Meta's Meta-ExternalAgent, Perplexity's PerplexityBot, and Microsoft's Bingbot — publicly commit to respecting robots.txt and have been audited by independent researchers. Bytespider (ByteDance) has historically had compliance issues but has been improving in 2025-2026.

What's the difference between blocking GPTBot and Google-Extended?

Blocking GPTBot stops OpenAI from using your content for training. Blocking Google-Extended stops Google from using your content for Gemini training — but Google-Extended is a robots.txt token, not a separate crawler, so blocking it doesn't affect standard Google Search indexing. Both are surgical opt-outs from training data while preserving search visibility.

Is your robots.txt blocking AI crawlers?

Free 30-second analyzer — paste your URL, get per-bot allow/block status with copy-paste fixes.

Updated for 2026 with current user-agent strings for OpenAI, Anthropic, Google, Microsoft, Apple, Meta, Perplexity, and ByteDance crawlers.