AI Crawlers List 2026: Every Bot Explained (GPTBot, ClaudeBot & More)

Last updated: June 2026. We review this list every month and update it whenever a new crawler ships or an existing one changes behavior.

Every day, your website gets visited by bots you never invited. Some of them are building the next GPT. Some are fetching your pages so an AI assistant can quote you in a conversation happening right now. Some take your content and send nothing back.

Researchers have catalogued more than 220 distinct AI crawler user agents. The good news: a dozen of them account for the overwhelming majority of AI crawl traffic. This guide covers every crawler that matters, what each one actually does, and exactly how to allow or block it.

If you only remember one thing from this page, make it this: AI crawlers are not one category. They are three. Treating them as one is the single most common robots.txt mistake we see.

The three types of AI crawlers

Almost every major AI company now runs a three-bot system. OpenAI started the pattern, and Anthropic and Perplexity followed it.

1. Training crawlers collect content to train future models. Blocking them keeps your future content out of training datasets, but it has no effect on data that was already collected. Examples: GPTBot, ClaudeBot, Bytespider.

2. Search index crawlers build the retrieval indexes behind AI search experiences. If you block these, your site stops appearing (and being cited) in AI search answers. Examples: OAI-SearchBot, Claude-SearchBot.

3. User-fetch agents retrieve a page on demand, because a human just asked the assistant something and the assistant needs your page to answer. These visits can turn into real referral traffic, since the assistant links to you as a source. Examples: ChatGPT-User, Claude-User, Perplexity-User.

The strategic consequence is simple: most sites benefit from allowing types 2 and 3 (visibility, citations, referrals) while making a deliberate choice about type 1 (training). Blocking everything because "AI is stealing content" also removes you from the channel that is currently growing faster than any other: total AI platform visits grew 42.8% year over year, from 15.6 billion in Q1 2025 to 27.4 billion in Q1 2026.

The reference table

User agent	Company	Type	What it does	robots.txt
`GPTBot`	OpenAI	Training	Collects content for model training	Respected
`OAI-SearchBot`	OpenAI	Search index	Indexes pages for ChatGPT search results	Respected
`ChatGPT-User`	OpenAI	User-fetch	Fetches pages live during ChatGPT conversations	Mostly (OpenAI says user-initiated fetches are "not fully governed" by robots.txt)
`ClaudeBot`	Anthropic	Training	Collects content for Claude model training	Respected
`Claude-SearchBot`	Anthropic	Search index	Indexes pages for Claude's search features	Respected
`Claude-User`	Anthropic	User-fetch	Fetches pages when Claude users request them	Respected
`PerplexityBot`	Perplexity	Search index	Indexes pages for cited answers	Respected
`Perplexity-User`	Perplexity	User-fetch	Fetches pages on user request	Disputed (documented compliance incidents in August 2025; Perplexity contests that user-initiated fetches must follow robots.txt)
`Google-Extended`	Google	Training control	Not a separate bot: a robots.txt token that tells Google not to use your content for Gemini training. Crawling still happens via Googlebot	Respected
`GoogleOther`	Google	Misc/R&D	Generic crawler for Google research and product development	Respected
`Applebot-Extended`	Apple	Training control	Like Google-Extended: a token controlling Apple Intelligence training use, crawling happens via Applebot	Respected
`Bytespider`	ByteDance	Training	Collects content for ByteDance models. Crawl volume more than doubled as of May 2026	Inconsistent reports; treat as unreliable
`CCBot`	Common Crawl	Training (indirect)	Builds the open Common Crawl dataset used by many AI labs	Respected
`Amazonbot`	Amazon	Training + answers	Feeds Alexa and Amazon AI features	Respected
`Meta-ExternalAgent`	Meta	Training	Collects content for Meta AI models	Respected
`Meta-ExternalFetcher`	Meta	User-fetch	On-demand fetching for Meta AI experiences	Respected
`FacebookBot`	Meta	Training/legacy	Older Meta crawler, still active on many sites	Respected
`DuckAssistBot`	DuckDuckGo	Search index	Powers DuckAssist AI answers	Respected
`cohere-ai`	Cohere	Training	Model training collection	Respected
`Diffbot`	Diffbot	Aggregation	Structured data extraction, resold to AI buyers	Respected

A note on the two "-Extended" entries: webmasters often look for a "Google-Extended bot" in their logs and never find it. That is expected. Google-Extended and Applebot-Extended are opt-out switches, not crawlers. The crawling is done by Googlebot and Applebot, which you almost certainly do not want to block.

OpenAI's crawlers, explained

OpenAI runs the cleanest three-bot separation, and with ChatGPT at roughly 700 million weekly active users and an estimated 87% of all AI platform traffic, its bots are the ones that matter most for your visibility.

GPTBot is the training crawler. Blocking it is a values decision more than a traffic decision: it does not remove you from ChatGPT search and does not stop citations. It only keeps future training runs from ingesting your content. GPTBot is currently the most blocked AI crawler on the web (roughly 11% of analyzed sites block it, more than any other AI bot).
OAI-SearchBot is what gets you into ChatGPT's search index. If you care about being cited and clicked from ChatGPT answers, this bot must be allowed. Blocking it while allowing GPTBot is the exact opposite of what most site owners actually want.
ChatGPT-User shows up when a real person's conversation triggered a fetch of your page. Think of it as a click that has not happened yet: the assistant reads your page, summarizes it, and links you as a source.

Anthropic's crawlers, explained

Anthropic mirrors the same pattern: ClaudeBot for training, Claude-SearchBot for search indexing, Claude-User for live fetches during Claude conversations. You may also still see legacy tokens like Claude-Web or anthropic-ai in older logs and robots.txt templates; the three bots above are the current canonical set.

Perplexity's crawlers, explained

Perplexity is an answer engine first, so its crawling exists almost entirely to cite sources. PerplexityBot builds the index, Perplexity-User fetches on demand. Worth knowing: Perplexity has publicly argued that user-initiated fetches are not subject to robots.txt, and independent investigations documented compliance incidents in August 2025. If you need a hard guarantee against Perplexity fetches, a robots.txt line is not enough; you will need server-level blocking (user agent filtering or a bot management layer).

Should you block AI crawlers?

There is no universal answer, but there is a rational framework: compare what each bot takes (crawl load, content value) with what it gives back (citations, referral traffic).

That ratio varies enormously by bot type. Search and user-fetch bots send traffic back; training bots do not. This is exactly what the crawl-to-refer ratio measures ↗, and the early data is striking: AI referral traffic to websites grew 357% year over year, and visitors who arrive from AI assistants convert at rates most channels never reach, because they arrive pre-qualified by a recommendation.

Practical guidance by site type:

Publishers and content businesses: consider blocking training bots (your content is your asset) while allowing search and user-fetch bots (citations are distribution).
SaaS, e-commerce and service businesses: allow everything except the chronically non-compliant. Being recommended by ChatGPT when someone asks "what tool should I use for X" is the new word of mouth, and you want your docs and comparison pages in every index.
Sites with paywalled or licensed content: block training and consider licensing deals; allow search bots only for free content sections.

Copy-paste robots.txt templates

Template 1: maximize AI visibility, opt out of training

# Allow AI search and user-fetch bots (citations and referrals)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

Template 2: allow everything (maximum AI distribution)

No robots.txt entries needed. Every compliant bot crawls by default. Consider adding an llms.txt file instead, to guide what AI systems should read first.

Template 3: block everything AI

Take Template 1, set every entry to Disallow: /, and remember two caveats: already-trained models keep what they learned, and non-compliant bots ignore the file anyway. For real enforcement, filter user agents at the server or CDN level.

How to see which AI bots crawl YOUR site

Everything above describes the general picture. Your site has its own reality, and it is probably different from what you expect.

Three ways to find out, from quick to thorough:

Check your robots.txt against the current bot list (the table above changes; your file probably has not).
Grep your server logs for the user agent tokens in the reference table and count requests per bot per week.
Use an analytics tool that separates AI crawl traffic from AI referral traffic and computes the ratio between them, so you can see which AI engines take content and which ones send customers. That is exactly what we built Datalenk for, and our free AI Traffic Checker does the robots.txt and visibility part in ten seconds, no signup required.

FAQ

What is GPTBot? GPTBot is OpenAI's training crawler. It collects web content used to train future GPT models. It respects robots.txt and is currently the most-blocked AI crawler on the web.

Does blocking GPTBot remove my site from ChatGPT? No. ChatGPT search visibility is controlled by OAI-SearchBot, and live page fetches are done by ChatGPT-User. Blocking GPTBot only opts your future content out of model training.

What is the difference between GPTBot and ChatGPT-User? GPTBot crawls proactively to gather training data. ChatGPT-User fetches a specific page on demand because a user's conversation needs it, which is the visit type that produces citations and referral clicks.

Do AI crawlers respect robots.txt? Most major ones do (OpenAI, Anthropic, Google, Apple, Amazon, Meta, Common Crawl). Documented exceptions and disputes exist, notably around Perplexity-User and Bytespider. For guarantees, block at the server level.

How many AI crawlers exist? Researchers have catalogued more than 220 distinct AI crawler user agents as of 2026. The twenty in our reference table account for the overwhelming majority of real-world AI crawl volume.

Should I use llms.txt or robots.txt for AI bots? Both, for different jobs. robots.txt controls access (allow/block). llms.txt is an emerging standard that tells AI systems what your site is about and which pages matter most, improving how you are represented in AI answers.

The Complete List of AI Crawlers in 2026 (GPTBot, ClaudeBot, PerplexityBot and More)

The three types of AI crawlers

The reference table

OpenAI's crawlers, explained

Anthropic's crawlers, explained

Perplexity's crawlers, explained

Should you block AI crawlers?

Copy-paste robots.txt templates

How to see which AI bots crawl YOUR site

FAQ

Measure the money,
not the pageviews

The three types of AI crawlers

The reference table

OpenAI's crawlers, explained

Anthropic's crawlers, explained

Perplexity's crawlers, explained

Should you block AI crawlers?

Copy-paste robots.txt templates

How to see which AI bots crawl YOUR site

FAQ

Measure the money,not the pageviews

Measure the money,
not the pageviews