Reference
Complete List of AI Crawlers (2026)
Every AI crawler currently fetching public web content — user-agent strings, owner organizations, purpose (training vs real-time vs search), and how to allow or block each one. Updated for 2026.
On this page
What AI crawlers are
AI crawlers are automated programs operated by AI companies (OpenAI, Anthropic, Google, Microsoft, Apple, Meta, Perplexity, ByteDance) that fetch public web content. Some add what they read to LLM training data. Others fetch content in real time when a user asks the AI something that needs browsing. A third group powers AI-search products that have their own indexes (ChatGPT Search, Bing AI).
Every AI crawler identifies itself with a user-agent string and respects (with rare exceptions) robots.txt. You can allow or block each one independently.
Training vs real-time crawlers
| Crawler type | What it does | Block effect |
|---|---|---|
| Training | Fetches content to add to LLM training corpus. Examples: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent, Bytespider. | Your content won't be in the model's baseline knowledge. Long-term effect. |
| Real-time | Fetches content during a live user query. Examples: ChatGPT-User, Claude-Web, PerplexityBot, Meta-ExternalFetcher. | AI can't cite your live content in answers. Immediate visibility loss. |
| Search | Builds a dedicated AI-search index. Examples: OAI-SearchBot (ChatGPT Search), Bingbot (powers ChatGPT Search + Copilot). | You disappear from that AI-search product entirely. |
For maximum AI visibility, allow all three categories. Many publishers selectively block training crawlers (concerns about copyright) while allowing real-time and search crawlers — that pattern preserves AI citation visibility without contributing to training data.
Complete crawler reference
Every major AI crawler in 2026, grouped by owner organization. User-agent strings are the canonical identifiers — copy them exactly into your robots.txt or server logs to filter.
OpenAI
Operates ChatGPT and the OpenAI API. Three distinct crawlers, each with a separate purpose.
GPTBotTrainingAdds content to OpenAI's training corpus. Block this if you don't want your content training future GPT models.
User-agent string
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; GPTBot/1.2; +https://openai.com/gptbot
ChatGPT-UserReal-timeFires when a ChatGPT user asks a question that requires fetching live web content. Blocking this means ChatGPT can't cite your live page in answers.
User-agent string
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
OAI-SearchBotSearchCrawls for ChatGPT Search's index. Blocking this removes you from ChatGPT Search results.
User-agent string
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot
Anthropic
Operates Claude. Two crawlers: training and real-time.
ClaudeBotTrainingCrawls public content for Claude's training corpus.
User-agent string
Mozilla/5.0 (compatible; ClaudeBot/1.0; +claudebot@anthropic.com)
Claude-WebReal-timeReal-time fetch when a Claude user asks something requiring browsing.
User-agent string
Mozilla/5.0 (compatible; Claude-Web/1.0; +https://www.anthropic.com/claude/web)
Operates Gemini, AI Overviews, and AI Mode. Uses a special user-agent token rather than a separate crawler.
Google-ExtendedTrainingNot a separate crawler — Google-Extended is a robots.txt token that opts you out of training data for Gemini, Vertex AI, and other Google AI products without affecting standard search indexing.
User-agent string
(token applied to Googlebot)
GoogleOtherReal-timeGeneric user-agent for non-search Google products. Used by various internal Google tools including AI features.
User-agent string
GoogleOther
Microsoft
Operates Copilot and Bing AI. ChatGPT Search ALSO uses Bing's index, so allowing Bingbot is critical for ChatGPT visibility.
BingbotSearchPowers Bing Search, Microsoft Copilot, AND ChatGPT Search. Block this and you lose visibility on three major AI surfaces at once.
User-agent string
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm
Perplexity AI
Operates the Perplexity answer engine. Crawls in real time when users ask questions.
PerplexityBotReal-timeReal-time fetch when a Perplexity user asks a question. Blocking PerplexityBot makes you invisible on Perplexity entirely.
User-agent string
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot
Perplexity-UserReal-timeDirect user-initiated fetches (when a user clicks a citation in a Perplexity answer).
User-agent string
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user
Apple
Operates Apple Intelligence and Siri's AI features.
Applebot-ExtendedTrainingApple's opt-out token for Apple Intelligence training data. Same pattern as Google-Extended — disallows training without removing you from Siri search results.
User-agent string
(token applied to Applebot)
ApplebotSearchStandard Apple search crawler used by Siri Suggestions and Spotlight Web Search.
User-agent string
Mozilla/5.0 (Device; OS_version) AppleWebKit/WebKit_version (KHTML, like Gecko) Version/Safari_version Safari/WebKit_version (Applebot/Applebot_version)
Meta
Operates Meta AI (Llama-based assistant in Messenger, Instagram, WhatsApp).
Meta-ExternalAgentTrainingCrawls for Meta AI training data and embeddings.
User-agent string
Meta-ExternalAgent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
Meta-ExternalFetcherReal-timeReal-time fetch for Meta AI live queries.
User-agent string
Meta-ExternalFetcher/1.0
ByteDance
Operates Doubao (China's largest AI assistant) and TikTok's AI features.
BytespiderTrainingByteDance crawler — has historically had a mixed track record with robots.txt compliance. If you serve a global audience and want Doubao visibility, allow this. Otherwise many publishers block it.
User-agent string
Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)
robots.txt template — allow all AI crawlers
Drop this into /robots.txt at your domain root to explicitly allow every major AI crawler. Recommended for any site that wants to appear in AI answers.
# Allow all AI crawlers — recommended for AEO # Place at https://yourdomain.com/robots.txt User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ClaudeBot Allow: / User-agent: Claude-Web Allow: / User-agent: Google-Extended Allow: / User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / User-agent: Applebot-Extended Allow: / User-agent: Meta-ExternalAgent Allow: / User-agent: Meta-ExternalFetcher Allow: / # Bingbot powers ChatGPT Search and Microsoft Copilot User-agent: bingbot Allow: / # Standard sitemap directive Sitemap: https://yourdomain.com/sitemap.xml
Free robots.txt analyzer
Check whether your robots.txt is blocking any AI crawler — paste your URL and get a per-bot status.
Should I allow AI crawlers?
Three frameworks publishers use, in increasing order of restrictiveness:
1. Allow everything
Maximum AI visibility. Your content trains LLMs and gets cited live. Most SaaS, content businesses, and tool sites pick this — the visibility upside far exceeds the training-data risk.
2. Allow real-time + search, block training
You appear in live AI answers but your content isn't used to train baseline models. Common for publishers and brands worried about copyright. Block: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, Meta-ExternalAgent, Bytespider. Allow: ChatGPT-User, OAI-SearchBot, Claude-Web, PerplexityBot, Meta-ExternalFetcher, Bingbot.
3. Block everything
Maximum control, zero AI visibility. Used by some news publishers and high-IP businesses. The trade-off is severe — your category gets answered by everyone except you.
For most AEO use cases — SaaS, agencies, content businesses, tools — option 1 (allow everything) is the right default. The fastest way to get cited by AI is to be readable by AI.
Frequently asked questions
What user-agent does ChatGPT use?
OpenAI operates three crawlers: GPTBot (training data crawler), ChatGPT-User (real-time queries when a user asks ChatGPT something that requires browsing), and OAI-SearchBot (the ChatGPT search index crawler). All three identify with distinct user-agent strings and can be controlled independently in robots.txt.
What is the user-agent for Claude?
Anthropic uses two crawlers: ClaudeBot (training data) and Claude-Web (real-time when a user asks Claude something requiring web access). Both identify themselves clearly and respect robots.txt directives.
What is Google-Extended?
Google-Extended is Google's user-agent token (not a separate crawler) that lets sites opt out of having their content used to train Gemini and other Google AI products without affecting standard Google Search indexing. Disallowing Google-Extended in robots.txt removes the site from Gemini's training data while keeping it indexed for regular search.
Should I allow all AI crawlers?
If you want your content cited by AI assistants — yes. Blocking AI crawlers means your product doesn't show up when ChatGPT, Claude, Perplexity, or Gemini answer questions about your category. The exception is real-time fetch crawlers (ChatGPT-User, Claude-Web, PerplexityBot) which don't add to training data — blocking these prevents AI from quoting your live content even when a user explicitly asks.
How do I block AI crawlers?
Add User-agent: <bot-name> followed by Disallow: / for each crawler in your robots.txt. For example, to block GPTBot: User-agent: GPTBot then Disallow: /. To allow it explicitly: Disallow: (empty value) or use Allow: /. See the full list of user-agents in this doc.
Do AI crawlers respect robots.txt?
All major AI crawlers from OpenAI, Anthropic, Google, Microsoft, and Apple publicly commit to respecting robots.txt directives. Bytespider (ByteDance) and some smaller crawlers have a mixed track record. PerplexityBot, ClaudeBot, and GPTBot have all been audited respecting robots.txt by independent researchers.
What's the difference between training crawlers and real-time crawlers?
Training crawlers (GPTBot, ClaudeBot, Google-Extended, Applebot-Extended) fetch content to add to LLM training data — what the model 'knows' baseline. Real-time crawlers (ChatGPT-User, Claude-Web, PerplexityBot) fetch content in response to a live user query that requires browsing — they don't store the data, just use it to answer that specific question. Most sites should allow both. Blocking real-time crawlers makes you invisible during live AI search even if you're in the training data.
Check if your site blocks any AI crawlers
Free 30-second analyzer — paste any URL and see per-bot allow/block status with copy-paste fixes.