← Blog
AI traffic

The Complete List of AI Crawlers in 2026 (GPTBot, ClaudeBot, PerplexityBot and More)

The complete, maintained list of AI crawlers in 2026. What GPTBot, ClaudeBot, PerplexityBot and 220+ other bots do on your site, and how to control each one.

9 min readDatalenk

Last updated: June 2026. We review this list every month and update it whenever a new crawler ships or an existing one changes behavior.

Every day, your website gets visited by bots you never invited. Some of them are building the next GPT. Some are fetching your pages so an AI assistant can quote you in a conversation happening right now. Some take your content and send nothing back.

Researchers have catalogued more than 220 distinct AI crawler user agents. The good news: a dozen of them account for the overwhelming majority of AI crawl traffic. This guide covers every crawler that matters, what each one actually does, and exactly how to allow or block it.

If you only remember one thing from this page, make it this: AI crawlers are not one category. They are three. Treating them as one is the single most common robots.txt mistake we see.

The three types of AI crawlers

Almost every major AI company now runs a three-bot system. OpenAI started the pattern, and Anthropic and Perplexity followed it.

1. Training crawlers collect content to train future models. Blocking them keeps your future content out of training datasets, but it has no effect on data that was already collected. Examples: GPTBot, ClaudeBot, Bytespider.

2. Search index crawlers build the retrieval indexes behind AI search experiences. If you block these, your site stops appearing (and being cited) in AI search answers. Examples: OAI-SearchBot, Claude-SearchBot.

3. User-fetch agents retrieve a page on demand, because a human just asked the assistant something and the assistant needs your page to answer. These visits can turn into real referral traffic, since the assistant links to you as a source. Examples: ChatGPT-User, Claude-User, Perplexity-User.

The strategic consequence is simple: most sites benefit from allowing types 2 and 3 (visibility, citations, referrals) while making a deliberate choice about type 1 (training). Blocking everything because "AI is stealing content" also removes you from the channel that is currently growing faster than any other: total AI platform visits grew 42.8% year over year, from 15.6 billion in Q1 2025 to 27.4 billion in Q1 2026.

The reference table

User agent Company Type What it does robots.txt
GPTBot OpenAI Training Collects content for model training Respected
OAI-SearchBot OpenAI Search index Indexes pages for ChatGPT search results Respected
ChatGPT-User OpenAI User-fetch Fetches pages live during ChatGPT conversations Mostly (OpenAI says user-initiated fetches are "not fully governed" by robots.txt)
ClaudeBot Anthropic Training Collects content for Claude model training Respected
Claude-SearchBot Anthropic Search index Indexes pages for Claude's search features Respected
Claude-User Anthropic User-fetch Fetches pages when Claude users request them Respected
PerplexityBot Perplexity Search index Indexes pages for cited answers Respected
Perplexity-User Perplexity User-fetch Fetches pages on user request Disputed (documented compliance incidents in August 2025; Perplexity contests that user-initiated fetches must follow robots.txt)
Google-Extended Google Training control Not a separate bot: a robots.txt token that tells Google not to use your content for Gemini training. Crawling still happens via Googlebot Respected
GoogleOther Google Misc/R&D Generic crawler for Google research and product development Respected
Applebot-Extended Apple Training control Like Google-Extended: a token controlling Apple Intelligence training use, crawling happens via Applebot Respected
Bytespider ByteDance Training Collects content for ByteDance models. Crawl volume more than doubled as of May 2026 Inconsistent reports; treat as unreliable
CCBot Common Crawl Training (indirect) Builds the open Common Crawl dataset used by many AI labs Respected
Amazonbot Amazon Training + answers Feeds Alexa and Amazon AI features Respected
Meta-ExternalAgent Meta Training Collects content for Meta AI models Respected
Meta-ExternalFetcher Meta User-fetch On-demand fetching for Meta AI experiences Respected
FacebookBot Meta Training/legacy Older Meta crawler, still active on many sites Respected
DuckAssistBot DuckDuckGo Search index Powers DuckAssist AI answers Respected
cohere-ai Cohere Training Model training collection Respected
Diffbot Diffbot Aggregation Structured data extraction, resold to AI buyers Respected

A note on the two "-Extended" entries: webmasters often look for a "Google-Extended bot" in their logs and never find it. That is expected. Google-Extended and Applebot-Extended are opt-out switches, not crawlers. The crawling is done by Googlebot and Applebot, which you almost certainly do not want to block.

OpenAI's crawlers, explained

OpenAI runs the cleanest three-bot separation, and with ChatGPT at roughly 700 million weekly active users and an estimated 87% of all AI platform traffic, its bots are the ones that matter most for your visibility.

  • GPTBot is the training crawler. Blocking it is a values decision more than a traffic decision: it does not remove you from ChatGPT search and does not stop citations. It only keeps future training runs from ingesting your content. GPTBot is currently the most blocked AI crawler on the web (roughly 11% of analyzed sites block it, more than any other AI bot).
  • OAI-SearchBot is what gets you into ChatGPT's search index. If you care about being cited and clicked from ChatGPT answers, this bot must be allowed. Blocking it while allowing GPTBot is the exact opposite of what most site owners actually want.
  • ChatGPT-User shows up when a real person's conversation triggered a fetch of your page. Think of it as a click that has not happened yet: the assistant reads your page, summarizes it, and links you as a source.

Anthropic's crawlers, explained

Anthropic mirrors the same pattern: ClaudeBot for training, Claude-SearchBot for search indexing, Claude-User for live fetches during Claude conversations. You may also still see legacy tokens like Claude-Web or anthropic-ai in older logs and robots.txt templates; the three bots above are the current canonical set.

Perplexity's crawlers, explained

Perplexity is an answer engine first, so its crawling exists almost entirely to cite sources. PerplexityBot builds the index, Perplexity-User fetches on demand. Worth knowing: Perplexity has publicly argued that user-initiated fetches are not subject to robots.txt, and independent investigations documented compliance incidents in August 2025. If you need a hard guarantee against Perplexity fetches, a robots.txt line is not enough; you will need server-level blocking (user agent filtering or a bot management layer).

Should you block AI crawlers?

There is no universal answer, but there is a rational framework: compare what each bot takes (crawl load, content value) with what it gives back (citations, referral traffic).

That ratio varies enormously by bot type. Search and user-fetch bots send traffic back; training bots do not. This is exactly what the crawl-to-refer ratio measures , and the early data is striking: AI referral traffic to websites grew 357% year over year, and visitors who arrive from AI assistants convert at rates most channels never reach, because they arrive pre-qualified by a recommendation.

Practical guidance by site type:

  • Publishers and content businesses: consider blocking training bots (your content is your asset) while allowing search and user-fetch bots (citations are distribution).
  • SaaS, e-commerce and service businesses: allow everything except the chronically non-compliant. Being recommended by ChatGPT when someone asks "what tool should I use for X" is the new word of mouth, and you want your docs and comparison pages in every index.
  • Sites with paywalled or licensed content: block training and consider licensing deals; allow search bots only for free content sections.

Copy-paste robots.txt templates

Template 1: maximize AI visibility, opt out of training

# Allow AI search and user-fetch bots (citations and referrals)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

Template 2: allow everything (maximum AI distribution)

No robots.txt entries needed. Every compliant bot crawls by default. Consider adding an llms.txt file instead, to guide what AI systems should read first.

Template 3: block everything AI

Take Template 1, set every entry to Disallow: /, and remember two caveats: already-trained models keep what they learned, and non-compliant bots ignore the file anyway. For real enforcement, filter user agents at the server or CDN level.

How to see which AI bots crawl YOUR site

Everything above describes the general picture. Your site has its own reality, and it is probably different from what you expect.

Three ways to find out, from quick to thorough:

  1. Check your robots.txt against the current bot list (the table above changes; your file probably has not).
  2. Grep your server logs for the user agent tokens in the reference table and count requests per bot per week.
  3. Use an analytics tool that separates AI crawl traffic from AI referral traffic and computes the ratio between them, so you can see which AI engines take content and which ones send customers. That is exactly what we built Datalenk for, and our free AI Traffic Checker does the robots.txt and visibility part in ten seconds, no signup required.

FAQ

What is GPTBot? GPTBot is OpenAI's training crawler. It collects web content used to train future GPT models. It respects robots.txt and is currently the most-blocked AI crawler on the web.

Does blocking GPTBot remove my site from ChatGPT? No. ChatGPT search visibility is controlled by OAI-SearchBot, and live page fetches are done by ChatGPT-User. Blocking GPTBot only opts your future content out of model training.

What is the difference between GPTBot and ChatGPT-User? GPTBot crawls proactively to gather training data. ChatGPT-User fetches a specific page on demand because a user's conversation needs it, which is the visit type that produces citations and referral clicks.

Do AI crawlers respect robots.txt? Most major ones do (OpenAI, Anthropic, Google, Apple, Amazon, Meta, Common Crawl). Documented exceptions and disputes exist, notably around Perplexity-User and Bytespider. For guarantees, block at the server level.

How many AI crawlers exist? Researchers have catalogued more than 220 distinct AI crawler user agents as of 2026. The twenty in our reference table account for the overwhelming majority of real-world AI crawl volume.

Should I use llms.txt or robots.txt for AI bots? Both, for different jobs. robots.txt controls access (allow/block). llms.txt is an emerging standard that tells AI systems what your site is about and which pages matter most, improving how you are represented in AI answers.

Measure the money,
not the pageviews

Cookieless, EU-hosted analytics that ties every visit to real Stripe revenue. Free in beta.