The complete, maintained list of AI crawlers in 2026. What GPTBot, ClaudeBot, PerplexityBot and 220+ other bots do on your site, and how to control each one.
Last updated: June 2026. We review this list every month and update it whenever a new crawler ships or an existing one changes behavior.
Every day, your website gets visited by bots you never invited. Some of them are building the next GPT. Some are fetching your pages so an AI assistant can quote you in a conversation happening right now. Some take your content and send nothing back.
Researchers have catalogued more than 220 distinct AI crawler user agents. The good news: a dozen of them account for the overwhelming majority of AI crawl traffic. This guide covers every crawler that matters, what each one actually does, and exactly how to allow or block it.
If you only remember one thing from this page, make it this: AI crawlers are not one category. They are three. Treating them as one is the single most common robots.txt mistake we see.
Almost every major AI company now runs a three-bot system. OpenAI started the pattern, and Anthropic and Perplexity followed it.
1. Training crawlers collect content to train future models. Blocking them keeps your future content out of training datasets, but it has no effect on data that was already collected. Examples: GPTBot, ClaudeBot, Bytespider.
2. Search index crawlers build the retrieval indexes behind AI search experiences. If you block these, your site stops appearing (and being cited) in AI search answers. Examples: OAI-SearchBot, Claude-SearchBot.
3. User-fetch agents retrieve a page on demand, because a human just asked the assistant something and the assistant needs your page to answer. These visits can turn into real referral traffic, since the assistant links to you as a source. Examples: ChatGPT-User, Claude-User, Perplexity-User.
The strategic consequence is simple: most sites benefit from allowing types 2 and 3 (visibility, citations, referrals) while making a deliberate choice about type 1 (training). Blocking everything because "AI is stealing content" also removes you from the channel that is currently growing faster than any other: total AI platform visits grew 42.8% year over year, from 15.6 billion in Q1 2025 to 27.4 billion in Q1 2026.
| User agent | Company | Type | What it does | robots.txt |
|---|---|---|---|---|
GPTBot |
OpenAI | Training | Collects content for model training | Respected |
OAI-SearchBot |
OpenAI | Search index | Indexes pages for ChatGPT search results | Respected |
ChatGPT-User |
OpenAI | User-fetch | Fetches pages live during ChatGPT conversations | Mostly (OpenAI says user-initiated fetches are "not fully governed" by robots.txt) |
ClaudeBot |
Anthropic | Training | Collects content for Claude model training | Respected |
Claude-SearchBot |
Anthropic | Search index | Indexes pages for Claude's search features | Respected |
Claude-User |
Anthropic | User-fetch | Fetches pages when Claude users request them | Respected |
PerplexityBot |
Perplexity | Search index | Indexes pages for cited answers | Respected |
Perplexity-User |
Perplexity | User-fetch | Fetches pages on user request | Disputed (documented compliance incidents in August 2025; Perplexity contests that user-initiated fetches must follow robots.txt) |
Google-Extended |
Training control | Not a separate bot: a robots.txt token that tells Google not to use your content for Gemini training. Crawling still happens via Googlebot | Respected | |
GoogleOther |
Misc/R&D | Generic crawler for Google research and product development | Respected | |
Applebot-Extended |
Apple | Training control | Like Google-Extended: a token controlling Apple Intelligence training use, crawling happens via Applebot | Respected |
Bytespider |
ByteDance | Training | Collects content for ByteDance models. Crawl volume more than doubled as of May 2026 | Inconsistent reports; treat as unreliable |
CCBot |
Common Crawl | Training (indirect) | Builds the open Common Crawl dataset used by many AI labs | Respected |
Amazonbot |
Amazon | Training + answers | Feeds Alexa and Amazon AI features | Respected |
Meta-ExternalAgent |
Meta | Training | Collects content for Meta AI models | Respected |
Meta-ExternalFetcher |
Meta | User-fetch | On-demand fetching for Meta AI experiences | Respected |
FacebookBot |
Meta | Training/legacy | Older Meta crawler, still active on many sites | Respected |
DuckAssistBot |
DuckDuckGo | Search index | Powers DuckAssist AI answers | Respected |
cohere-ai |
Cohere | Training | Model training collection | Respected |
Diffbot |
Diffbot | Aggregation | Structured data extraction, resold to AI buyers | Respected |
A note on the two "-Extended" entries: webmasters often look for a "Google-Extended bot" in their logs and never find it. That is expected. Google-Extended and Applebot-Extended are opt-out switches, not crawlers. The crawling is done by Googlebot and Applebot, which you almost certainly do not want to block.
OpenAI runs the cleanest three-bot separation, and with ChatGPT at roughly 700 million weekly active users and an estimated 87% of all AI platform traffic, its bots are the ones that matter most for your visibility.
Anthropic mirrors the same pattern: ClaudeBot for training, Claude-SearchBot for search indexing, Claude-User for live fetches during Claude conversations. You may also still see legacy tokens like Claude-Web or anthropic-ai in older logs and robots.txt templates; the three bots above are the current canonical set.
Perplexity is an answer engine first, so its crawling exists almost entirely to cite sources. PerplexityBot builds the index, Perplexity-User fetches on demand. Worth knowing: Perplexity has publicly argued that user-initiated fetches are not subject to robots.txt, and independent investigations documented compliance incidents in August 2025. If you need a hard guarantee against Perplexity fetches, a robots.txt line is not enough; you will need server-level blocking (user agent filtering or a bot management layer).
There is no universal answer, but there is a rational framework: compare what each bot takes (crawl load, content value) with what it gives back (citations, referral traffic).
That ratio varies enormously by bot type. Search and user-fetch bots send traffic back; training bots do not. This is exactly what the crawl-to-refer ratio measures ↗, and the early data is striking: AI referral traffic to websites grew 357% year over year, and visitors who arrive from AI assistants convert at rates most channels never reach, because they arrive pre-qualified by a recommendation.
Practical guidance by site type:
Template 1: maximize AI visibility, opt out of training
# Allow AI search and user-fetch bots (citations and referrals)
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: PerplexityBot
Allow: /
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
Template 2: allow everything (maximum AI distribution)
No robots.txt entries needed. Every compliant bot crawls by default. Consider adding an llms.txt file instead, to guide what AI systems should read first.
Template 3: block everything AI
Take Template 1, set every entry to Disallow: /, and remember two caveats: already-trained models keep what they learned, and non-compliant bots ignore the file anyway. For real enforcement, filter user agents at the server or CDN level.
Everything above describes the general picture. Your site has its own reality, and it is probably different from what you expect.
Three ways to find out, from quick to thorough:
What is GPTBot? GPTBot is OpenAI's training crawler. It collects web content used to train future GPT models. It respects robots.txt and is currently the most-blocked AI crawler on the web.
Does blocking GPTBot remove my site from ChatGPT? No. ChatGPT search visibility is controlled by OAI-SearchBot, and live page fetches are done by ChatGPT-User. Blocking GPTBot only opts your future content out of model training.
What is the difference between GPTBot and ChatGPT-User? GPTBot crawls proactively to gather training data. ChatGPT-User fetches a specific page on demand because a user's conversation needs it, which is the visit type that produces citations and referral clicks.
Do AI crawlers respect robots.txt? Most major ones do (OpenAI, Anthropic, Google, Apple, Amazon, Meta, Common Crawl). Documented exceptions and disputes exist, notably around Perplexity-User and Bytespider. For guarantees, block at the server level.
How many AI crawlers exist? Researchers have catalogued more than 220 distinct AI crawler user agents as of 2026. The twenty in our reference table account for the overwhelming majority of real-world AI crawl volume.
Should I use llms.txt or robots.txt for AI bots? Both, for different jobs. robots.txt controls access (allow/block). llms.txt is an emerging standard that tells AI systems what your site is about and which pages matter most, improving how you are represented in AI answers.
Cookieless, EU-hosted analytics that ties every visit to real Stripe revenue. Free in beta.