Crawlers & Technical Setup· Part 9 of 9

Do AI assistants respect robots.txt? A 12-engine test

When a user pastes a blocked URL into a chatbot, does it obey your robots.txt? A June-2026 test put 12 assistants to the test — only three honoured the block and declared an honest user-agent. Here's what that means for blocking versus being cited.

4 min readUpdated June 19, 2026

When a user pastes a URL into a chatbot, does the assistant check your robots.txt before fetching it? A June-2026 test by Search Engine World put 12 assistants to the test, and only three — ChatGPT, Claude, and Perplexity — honoured a robots.txt block; the other nine fetched the page anyway. The lesson for site owners is blunt: robots.txt is a request that compliant bots choose to honour, not a wall — so it shapes who politely stays out, not who can.

This is part of the crawlers and technical series, and a companion to the test on whether AI assistants render your JavaScript. Here the question is compliance: does the assistant respect the file that's supposed to govern access?

What did the robots.txt test measure?

On-demand fetches, verified by logs. For each assistant the researchers built two isolated test URLs — one allowed, one disallowed in robots.txt — each carrying a unique "canary" reference string. They pasted the blocked URL into each assistant's chat and asked it to read the page, then checked their own server logs (not the chatbot's self-report) to see whether the fetch happened and whether the page returned its secret string. Every page that was fetched returned its exact canary, which proves a genuine retrieval rather than a hallucinated answer.

One scope note to read the result honestly: this tests user-initiated fetches — a person explicitly asking the assistant to open a specific URL. That's the ChatGPT-User / Claude-User category of activity, which several vendors classify as a user action rather than autonomous crawling, and the Robots Exclusion Protocol is genuinely ambiguous about whether it should apply. The test doesn't settle that debate; it shows what the assistants do.

Which assistants respected robots.txt — and which ignored it?

Only a quarter complied. The three that did also declared honest, identifiable user-agents; the nine that didn't used a range of disguises.

Behaviour Assistants User-agent
Respected the block ChatGPT, Claude, Perplexity Honest, declared (ChatGPT-User, Claude-User, Perplexity-User)
Ignored the block Gemini, Meta AI, Microsoft Copilot, Grok, DeepSeek, Qwen, ERNIE, Kimi Disguised or undeclared

Among the nine non-compliant systems, the disguises escalated: Gemini fetched as a generic Google client, Meta AI used undeclared crawler variants, Copilot outsourced the fetch to Diffbot, and several (DeepSeek, Qwen, Kimi) presented faked browser identities — including "proxy swarms" across multiple countries with impossible user-agent strings like Windows NT 11.0. All findings are attributed to Search Engine World's June-2026 test; it is a single experiment focused on user-initiated fetches, so treat the specific pass/fail list as a point-in-time snapshot that will shift as vendors change behaviour.

Why doesn't robots.txt reliably stop AI fetchers?

Because it was never an enforcement mechanism. The Robots Exclusion Protocol is a published request: a well-behaved AI crawler reads your robots.txt and voluntarily stays out of disallowed paths. Nothing in the protocol forces compliance. A fetcher that decides to ignore it — or that classifies a user-pasted URL as a user action outside the protocol's scope — simply requests the page, and your server returns it.

robots.txt tells polite bots where not to go. It does nothing to a bot that doesn't ask politely. If a page must stay private, enforce that at the server, not in a text file the fetcher is free to ignore.

That reframes two common mistakes. First, using robots.txt as a security control: it is the wrong tool, because the bots you most want to keep out are the least likely to honour it. Second, assuming a Disallow line guarantees you won't appear in an answer — it doesn't, if the assistant fetches anyway. To genuinely block traffic you need a CDN or WAF rule that returns a 403 by user-agent, IP, or behaviour — the same layer that, configured wrong, ends up blocking AI crawlers by accident.

What should I actually do about it?

Decide your stance deliberately, then enforce it where enforcement lives, and verify with logs:

  • If you want AI visibility (most brands): allow them. Don't rely on robots.txt to block, and don't block by accident — confirm the major user-agents in the AI crawler user-agent directory aren't disallowed. The cost of being unreadable is being uncited.
  • If you must protect content: enforce at the CDN/WAF. A robots.txt Disallow is a courtesy, not a lock. Put paywalled or proprietary content behind real access control and return a 403 to fetchers you've decided to exclude.
  • Identify traffic by behaviour, not just the declared name. Because non-compliant fetchers disguise their user-agent, read your raw server logs to see which AI bots actually crawl your site — disguised fetches hide behind ordinary browser strings and won't show up in a name-only filter.
  • Don't trust the chatbot's word. Compliance is what your logs show, not what the assistant claims when asked.

robots.txt remains worth getting right — it's how you signal intent to the bots that honour it, and how you avoid the far more common own-goal of blocking the crawlers that would have cited you. But treat it as a signal, not a guarantee. Knowing which engines reach your pages, under which user-agents, and whether that turns into citations is exactly what Buffy Intel keeps watch on. Questions: [email protected].

Frequently asked

Do AI assistants obey robots.txt?
Some do, many don't — and it depends on the bot's job. In a June-2026 test by Search Engine World, when a user pasted a URL blocked by robots.txt into the chat, only 3 of 12 assistants honoured the block: ChatGPT, Claude, and Perplexity. The other nine fetched the blocked page anyway. Note this tested user-initiated fetches, which several vendors treat as a user action rather than crawling — a grey area in the standard. The honest summary: robots.txt is a request, not an enforcement mechanism, and you can't assume every assistant will follow it.
If robots.txt doesn't stop them, how do I actually block an AI bot?
At the server or CDN, not in robots.txt. robots.txt is a voluntary directive that compliant bots choose to honour; it has no power to stop a fetcher that ignores it. To genuinely block traffic you need a CDN or WAF rule (by user-agent, IP, or behaviour) that returns a 403. But blocking is rarely the right goal for a brand that wants AI visibility — if engines can't read you, they cite whoever they can. Decide deliberately, then enforce where enforcement actually lives.
Should I block AI assistants from my site?
For most brands that want to be discovered, no. Blocking the bots that ground and cite live answers forfeits citations to competitors the engine can still read. Blocking is defensible for proprietary or paywalled content — but in that case use real enforcement (CDN/WAF), since this test shows robots.txt alone won't reliably keep non-compliant assistants out. The bigger risk for most sites is the opposite: blocking AI crawlers by accident.
say hi
Buffy the golden retriever peeking over the card
Let's go

Show upwhere shoppers are looking.

Free tier · no card required · live in five minutes. Buffy will be wagging on the other side of the install.

(or just come say hi)