When a user pastes a URL into a chatbot, does the assistant check your robots.txt before fetching it? A June-2026 test by Search Engine World put 12 assistants to the test, and only three — ChatGPT, Claude, and Perplexity — honoured a robots.txt block; the other nine fetched the page anyway. The lesson for site owners is blunt: robots.txt is a request that compliant bots choose to honour, not a wall — so it shapes who politely stays out, not who can.
This is part of the crawlers and technical series, and a companion to the test on whether AI assistants render your JavaScript. Here the question is compliance: does the assistant respect the file that's supposed to govern access?
What did the robots.txt test measure?
On-demand fetches, verified by logs. For each assistant the researchers built two isolated test URLs — one allowed, one disallowed in robots.txt — each carrying a unique "canary" reference string. They pasted the blocked URL into each assistant's chat and asked it to read the page, then checked their own server logs (not the chatbot's self-report) to see whether the fetch happened and whether the page returned its secret string. Every page that was fetched returned its exact canary, which proves a genuine retrieval rather than a hallucinated answer.
One scope note to read the result honestly: this tests user-initiated fetches — a person explicitly asking the assistant to open a specific URL. That's the ChatGPT-User / Claude-User category of activity, which several vendors classify as a user action rather than autonomous crawling, and the Robots Exclusion Protocol is genuinely ambiguous about whether it should apply. The test doesn't settle that debate; it shows what the assistants do.
Which assistants respected robots.txt — and which ignored it?
Only a quarter complied. The three that did also declared honest, identifiable user-agents; the nine that didn't used a range of disguises.
| Behaviour | Assistants | User-agent |
|---|---|---|
| Respected the block | ChatGPT, Claude, Perplexity | Honest, declared (ChatGPT-User, Claude-User, Perplexity-User) |
| Ignored the block | Gemini, Meta AI, Microsoft Copilot, Grok, DeepSeek, Qwen, ERNIE, Kimi | Disguised or undeclared |
Among the nine non-compliant systems, the disguises escalated: Gemini fetched as a generic Google client, Meta AI used undeclared crawler variants, Copilot outsourced the fetch to Diffbot, and several (DeepSeek, Qwen, Kimi) presented faked browser identities — including "proxy swarms" across multiple countries with impossible user-agent strings like Windows NT 11.0. All findings are attributed to Search Engine World's June-2026 test; it is a single experiment focused on user-initiated fetches, so treat the specific pass/fail list as a point-in-time snapshot that will shift as vendors change behaviour.
Why doesn't robots.txt reliably stop AI fetchers?
Because it was never an enforcement mechanism. The Robots Exclusion Protocol is a published request: a well-behaved AI crawler reads your robots.txt and voluntarily stays out of disallowed paths. Nothing in the protocol forces compliance. A fetcher that decides to ignore it — or that classifies a user-pasted URL as a user action outside the protocol's scope — simply requests the page, and your server returns it.
robots.txt tells polite bots where not to go. It does nothing to a bot that doesn't ask politely. If a page must stay private, enforce that at the server, not in a text file the fetcher is free to ignore.
That reframes two common mistakes. First, using robots.txt as a security control: it is the wrong tool, because the bots you most want to keep out are the least likely to honour it. Second, assuming a Disallow line guarantees you won't appear in an answer — it doesn't, if the assistant fetches anyway. To genuinely block traffic you need a CDN or WAF rule that returns a 403 by user-agent, IP, or behaviour — the same layer that, configured wrong, ends up blocking AI crawlers by accident.
What should I actually do about it?
Decide your stance deliberately, then enforce it where enforcement lives, and verify with logs:
- If you want AI visibility (most brands): allow them. Don't rely on robots.txt to block, and don't block by accident — confirm the major user-agents in the AI crawler user-agent directory aren't disallowed. The cost of being unreadable is being uncited.
- If you must protect content: enforce at the CDN/WAF. A robots.txt
Disallowis a courtesy, not a lock. Put paywalled or proprietary content behind real access control and return a 403 to fetchers you've decided to exclude. - Identify traffic by behaviour, not just the declared name. Because non-compliant fetchers disguise their user-agent, read your raw server logs to see which AI bots actually crawl your site — disguised fetches hide behind ordinary browser strings and won't show up in a name-only filter.
- Don't trust the chatbot's word. Compliance is what your logs show, not what the assistant claims when asked.
robots.txt remains worth getting right — it's how you signal intent to the bots that honour it, and how you avoid the far more common own-goal of blocking the crawlers that would have cited you. But treat it as a signal, not a guarantee. Knowing which engines reach your pages, under which user-agents, and whether that turns into citations is exactly what Buffy Intel keeps watch on. Questions: [email protected].