March 4, 2026
5 Ways Your Robots.txt Is Blocking AI Crawlers
You spent years perfecting your robots.txt for Google. But the new generation of AI crawlers plays by different rules — and your current setup might be making you invisible to them.
Your Robots.txt Was Written for a Different Era
Every website has a robots.txt. It is one of the first files you set up, and then you mostly forget about it. For twenty years, that was fine. The file told Googlebot and Bingbot what to crawl, and that covered essentially all the discovery that mattered.
But the landscape has changed. A new generation of AI crawlers is visiting your site — GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, CCBot (Common Crawl), and others. These crawlers feed the language models that increasingly answer questions about your business.
And here is the problem: most robots.txt files were written without these crawlers in mind. The rules you set up for search engines might be actively blocking the AI systems that your potential customers are using right now. Let's look at the five most common ways this happens.
Problem 1: Blanket Disallow Blocking Everything
The most aggressive robots.txt mistake is also the simplest. Some sites include a wildcard disallow rule that blocks all bots by default:
- User-agent: *
- Disallow: /
If you have this, every AI crawler is blocked from your entire site. You are completely invisible to language models. When someone asks ChatGPT about your business, the model has nothing from your actual website to work with — it can only rely on third-party mentions, outdated training data, or outright hallucination.
This was sometimes used as a security measure or a lazy way to prevent scraping. But in the AI era, it is the equivalent of removing your business from every phone book, directory, and search engine simultaneously.
The fix: If you want to control access granularly, use specific user-agent rules instead of a blanket wildcard block. Allow the crawlers that matter and block only the paths that need protection.
Problem 2: Missing AI-Specific User Agents
Many robots.txt files only define rules for Googlebot and maybe Bingbot. Everything else falls through to the wildcard rule — which might or might not be permissive. The AI crawlers that matter most right now include:
- GPTBot — OpenAI's crawler that feeds ChatGPT and its products
- ClaudeBot — Anthropic's crawler for the Claude model family
- PerplexityBot — Perplexity AI's research and answer engine crawler
- CCBot — Common Crawl's crawler, which feeds many open-source language models
- Google-Extended — Google's crawler specifically for Gemini AI training
If your robots.txt doesn't explicitly address these user agents, their access depends on whatever your wildcard rule says. And if your wildcard rule is restrictive (which is common for sites concerned about scraping), you are blocking AI visibility without realizing it.
The fix: Add explicit rules for AI crawlers. Even a simple allow statement ensures they are not caught by restrictive wildcard rules:
- User-agent: GPTBot
- Allow: /
- User-agent: ClaudeBot
- Allow: /
- User-agent: PerplexityBot
- Allow: /
Problem 3: Crawl-Delay Too Aggressive for AI Bots
Crawl-delay directives tell bots to wait a specified number of seconds between requests. This was useful for protecting servers from aggressive search engine crawlers that might hammer your site with hundreds of requests per minute.
But AI crawlers typically make far fewer requests than search engine spiders. They are visiting your key pages, not trying to index every URL on your domain. A crawl-delay of 10 or 30 seconds — which might be reasonable for a search engine doing a full site crawl — can make AI crawlers give up before they finish reading your most important content.
Some sites set extremely aggressive crawl-delays specifically to discourage scraping. While the intent is understandable, the side effect is that AI models get an incomplete picture of your business. They might read your homepage but never reach your product pages, your about page, or your pricing.
The fix: Set reasonable crawl-delays for AI crawlers specifically. A delay of 1 to 2 seconds is typically sufficient to protect your server without preventing AI models from reading your key content. Or better yet, rely on rate limiting at the server level instead of robots.txt crawl-delay, which gives you more precise control.
Problem 4: Blocking Paths That AI Needs
Many robots.txt files block paths like /api/, /data/, /assets/, or /internal/. For search engines, this makes sense — you don't want Google indexing your API endpoints or raw data files.
But some of these paths contain information that helps AI models understand your business. A common example: blocking /data/ when that directory contains JSON-LD files, product feeds, or structured content that AI models rely on for accurate understanding. Or blocking /api/ when you have public documentation endpoints that describe your product capabilities.
Another common pattern: blocking everything except a small whitelist of pages. This was a valid SEO strategy for preventing thin content from diluting your search rankings. But for AI crawlers, it means the model can only read the few pages you explicitly allowed — and might miss context that's crucial for accurate understanding.
The fix: Review your disallow rules with AI crawlers specifically in mind. Ask yourself: does blocking this path prevent AI from understanding my business accurately? If you have structured data, content APIs, or documentation in blocked paths, consider allowing AI crawlers to access them while keeping search engine restrictions in place.
Problem 5: No llms.txt Reference
This is not strictly a robots.txt problem, but it is a missed opportunity that lives in the same file. llms.txt is an emerging standard that gives AI models a structured overview of your site — a table of contents for language models.
Your robots.txt is the first file that any well-behaved crawler checks. It is the natural place to reference your llms.txt, just like you reference your sitemap there. If you have an llms.txt but don't reference it in robots.txt, AI crawlers might miss it entirely.
And if you don't have an llms.txt at all, you are leaving AI models to figure out your site structure on their own. That is like publishing a 500-page book without a table of contents and hoping readers find the chapters that matter.
The fix: Create an llms.txt file for your site and reference it in your robots.txt alongside your sitemap. Even a minimal llms.txt with your core pages is better than none.
How to Fix Your Robots.txt for AI
Here is a practical robots.txt template that balances search engine control with AI accessibility:
- User-agent: *
- Allow: /
- Disallow: /admin/
- Disallow: /private/
- User-agent: GPTBot
- Allow: /
- Disallow: /admin/
- User-agent: ClaudeBot
- Allow: /
- Disallow: /admin/
- User-agent: PerplexityBot
- Allow: /
- Disallow: /admin/
- Sitemap: https://yourdomain.com/sitemap.xml
The key principles: explicitly allow AI crawlers, block only truly private paths, keep your disallow list minimal, and reference your sitemap and llms.txt.
Of course, every site is different. The best way to find out exactly what your robots.txt is doing to your AI visibility is to run a diagnostic scan. aystos Cockpit checks your robots.txt against every major AI crawler, identifies blocking rules that harm your AI perception, and tells you exactly what to change.
The scan takes 60 seconds, costs nothing for the foundation audit, and might reveal that you have been invisible to AI for months without knowing it. Try it now.
Is Your Robots.txt Blocking AI?
Scan your site and find out exactly which AI crawlers can — and cannot — reach your content.
Scan Your Site Free →