Resources

AI Perception Glossary

The language of the AI-mediated web is still being written. This glossary defines the concepts that matter most — from how AI crawlers discover your content to how large language models decide what to say about your business.

#

5W1H Analysis

When a journalist writes a story, they answer six fundamental questions: Who, What, When, Where, Why, and How. AI models do the same thing when they read your website — except they do it automatically, and they don't always get it right.

5W1H analysis is the framework aystos uses to evaluate how well AI understands your content along these six dimensions. For example: Does the model correctly identify who your company is? Does it understand what services you offer, where you operate, and how you differ from competitors? Gaps in any of these dimensions can lead to AI giving incomplete or incorrect answers about your business.

The aystos Cockpit runs 5W1H analysis as part of every deep audit, comparing what AI models extract from your pages against what your content actually says. When there's a mismatch — say, AI thinks you're headquartered in Berlin when you're actually in Munich — that shows up as an alignment issue with a clear fix path.

A

AI Crawlers

AI crawlers are automated bots that visit websites to collect content for training or powering AI models. They work similarly to traditional search engine crawlers like Googlebot, but their purpose is different: instead of building a search index, they're feeding content into large language models that generate conversational answers.

The most well-known AI crawlers include GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot (Perplexity AI), and CCBot (Common Crawl). Each has its own user-agent string and respects (or should respect) your robots.txt directives. The challenge is that many website owners don't realize these bots exist — or that blocking them means AI models can't accurately represent their business.

aystos checks whether your site is accessible to all major AI crawlers as part of the crawlability score in every Cockpit audit. If you're accidentally blocking GPTBot but allowing ClaudeBot, you'll know — and you'll get specific guidance on how to fix it.

AI Perception Score

The AI Perception Score is aystos's composite metric that tells you, on a scale of 0 to 100, how well AI models understand your website. Think of it as a health check for your AI visibility — not just whether AI can find your content, but whether it actually understands what your business does.

The score is calculated across four weighted dimensions: Crawlability (20%) measures whether AI bots can access your pages. Machine Readability (25%) checks your structured data and metadata. AI Interpretation (35%) evaluates what models actually understand when they read your content using 5W1H analysis. Alignment (20%) compares AI's interpretation against reality to catch hallucination risks.

You can get your AI Perception Score for free with a Cockpit foundation audit — it takes about 60 seconds and doesn't require a signup. The deep analysis adds multi-model interpretation, a narrative briefing, and page-level issue detection for sites that need a thorough review.

Alignment Score

The Alignment Score measures how well AI's understanding of your website matches what your content actually says. It's one of the four dimensions of the AI Perception Score, weighted at 20% of the total.

Here's why it matters: AI models don't just repeat what's on your website — they interpret, summarize, and sometimes extrapolate. That interpretation can drift from reality. A model might correctly identify that you're a law firm but incorrectly state that you specialize in patent law when you actually focus on employment law. The facts are partially right, which makes the error harder to catch and more dangerous to your business.

The aystos Cockpit calculates alignment by comparing AI-extracted claims against your actual published content. Misalignments surface as specific, fixable issues — not vague warnings. If AI says you're in the wrong city or lists services you don't offer, you'll see exactly where the confusion originates and what content changes will correct it.

B

BYOK (Bring Your Own Key)

BYOK stands for Bring Your Own Key — a deployment model where you use your own API keys from AI providers like OpenAI, Anthropic, or Google instead of routing requests through a shared service. It's a common requirement for organizations that need full control over data governance, costs, and provider relationships.

In the context of aystos, BYOK means that enterprise customers can configure their own API credentials for every AI operation — audits, content generation, search synthesis. Your content goes directly to your AI provider under your account, your rate limits, and your data processing agreements. No intermediary sees the data.

BYOK customers bypass aystos credit deduction entirely. You pay your AI providers at their published rates, and aystos handles the orchestration, analysis pipeline, and reporting. It's the right choice for regulated industries, large enterprises with existing AI contracts, or anyone who wants complete transparency over where their data goes.

C

CCBot

CCBot is the web crawler operated by Common Crawl, a nonprofit organization that maintains one of the largest open web archives in the world. Unlike proprietary crawlers from AI companies, CCBot collects data that's made freely available — which means its archive has been used to train a wide range of AI models, including many you interact with daily.

Common Crawl data forms part of the training datasets for models from multiple AI providers. That makes CCBot one of the most consequential crawlers for how AI understands your website: blocking it doesn't just affect one model — it can affect many. On the other hand, allowing it means your content contributes to the broad AI training ecosystem.

aystos checks CCBot access as part of the crawlability score in every Cockpit audit. If your robots.txt blocks CCBot, you'll see a specific issue flagged with guidance on the trade-offs of allowing or restricting access. The decision is yours — but it should be an informed one.

ClaudeBot

ClaudeBot is the web crawler operated by Anthropic to fetch content for its Claude family of AI models. When someone asks Claude a question that involves information from the web, ClaudeBot is how that content gets collected and indexed.

ClaudeBot identifies itself with the user-agent string ClaudeBot and respects standard robots.txt directives. Anthropic publishes documentation about its crawling practices and offers a way for site owners to control access. Like other AI crawlers, blocking ClaudeBot means Claude won't have up-to-date information about your website — which can lead to outdated or hallucinated responses when users ask about your business.

The aystos Cockpit checks ClaudeBot access alongside other major AI crawlers. If your site blocks ClaudeBot but allows GPTBot, for instance, you'll see that discrepancy flagged — because inconsistent crawler policies mean different AI models will have different (and potentially contradictory) information about you.

Content Semantic Alignment

Content Semantic Alignment describes the degree to which your published content matches what AI models actually understand and communicate about your business. It goes beyond simple keyword matching — it's about whether the meaning of your content is preserved when AI processes and summarizes it.

Imagine you're a sustainable fashion brand. Your website clearly states you use organic cotton and ethical supply chains. High semantic alignment means AI models accurately convey this positioning when users ask about sustainable clothing brands. Low alignment means AI might describe you as just another fashion retailer — or worse, associate you with practices you've deliberately avoided.

Semantic alignment is influenced by how clearly your content communicates key concepts, whether your structured data reinforces your messaging, and whether conflicting signals on different pages confuse AI interpretation. The aystos Cockpit measures this through multi-model 5W1H analysis and surfaces misalignments as specific, actionable issues.

Crawlability Score

The Crawlability Score measures whether AI models can actually access your website's content. It's the first dimension of the AI Perception Score, weighted at 20% of the total — because if AI can't read your pages, nothing else matters.

This score evaluates several technical factors: Are AI crawlers allowed in your robots.txt? Does your sitemap cover your important pages? Is SSL properly configured? Are canonical URLs set correctly? Does your site have an llms.txt file that helps AI understand your content hierarchy? A surprising number of websites accidentally block major AI crawlers — which means ChatGPT, Claude, and Perplexity literally cannot read their pages.

The aystos Cockpit checks all these factors in every foundation audit — free, no signup required. If you're blocking GPTBot without realizing it, or your sitemap is missing half your pages, you'll know in about 60 seconds. The aystos Client can fix many crawlability issues automatically from inside your CMS.

E

Entity Extraction

Entity extraction is the process by which AI identifies and categorizes the people, places, organizations, products, and concepts mentioned in your content. When a large language model reads your website, it doesn't just see words — it tries to understand what those words refer to in the real world.

For example, if your page mentions "Dr. Sarah Chen, Chief Medical Officer at MedTech Solutions in Boston," entity extraction identifies a person (Dr. Sarah Chen), a role (Chief Medical Officer), an organization (MedTech Solutions), and a location (Boston). The accuracy of this extraction directly affects how AI answers questions about your business — get entity extraction wrong, and AI might attribute your CEO's quotes to a competitor or confuse your headquarters location.

The aystos Cockpit evaluates entity extraction as part of the AI Interpretation dimension. It checks whether models correctly identify your key entities — your company name, leadership, locations, products, and industry. Clear structured data via JSON-LD and unambiguous content significantly improve entity extraction accuracy. The aystos Client auto-generates Organization, Person, and Product schemas to help AI get your entities right.

G

GPTBot

GPTBot is OpenAI's web crawler, used to fetch content that powers ChatGPT and the broader GPT model family. Given that ChatGPT is one of the most widely used AI assistants in the world, GPTBot's access to your website directly influences what millions of people hear about your business every day.

GPTBot identifies itself with the user-agent string GPTBot and respects robots.txt directives. OpenAI provides documentation on how to allow or block GPTBot, including options for allowing access to some paths while restricting others. Blocking GPTBot entirely means ChatGPT will rely on older training data or third-party sources when answering questions about you — neither of which you control.

The aystos Cockpit checks GPTBot access in every audit. If your robots.txt blocks GPTBot — intentionally or accidentally — you'll see it flagged with a clear explanation of what it means for your AI visibility. For most businesses, ensuring GPTBot can access your key pages is one of the highest-impact, lowest-effort improvements you can make.

H

Hallucination Risk

In AI, a hallucination is when a model generates information that sounds confident and plausible but is factually wrong. Hallucination risk measures how likely it is that AI will invent facts about your business — wrong locations, nonexistent products, fabricated reviews, or misattributed expertise.

Hallucination risk increases when AI has insufficient or ambiguous information to work with. If your website lacks clear structured data, blocks AI crawlers, or presents contradictory information across pages, models are more likely to fill in the gaps with educated guesses. Those guesses might be close to reality — or they might be completely wrong. Either way, they're presented with the same confidence as verified facts.

The aystos Cockpit specifically analyzes hallucination risk as part of the Alignment Score. It identifies claims AI makes about your business that don't match your published content — and traces them back to the content gaps or ambiguities that caused the hallucination. Fixing these root causes through clearer content and better structured data is the most effective way to reduce hallucination risk.

J

JSON-LD

JSON-LD (JavaScript Object Notation for Linked Data) is a format for embedding structured data in your web pages. It lets you describe your content in a way that machines — including AI models — can understand without having to parse and interpret your HTML.

A JSON-LD block is a <script> tag in your page's head section that contains structured information about your business, articles, products, events, or any other entity your page describes. It uses the Schema.org vocabulary, so there's a standardized way to say "this is an Organization, its name is X, it's located in Y, and it offers service Z." AI models use this data as high-confidence signals because it's explicitly structured, not inferred from free text.

JSON-LD is the single highest-impact technical improvement for AI readability. The aystos Client auto-generates JSON-LD for every page on your site — Organization, WebSite, WebPage, Article, BreadcrumbList — directly from your CMS data. No manual markup, no per-page effort. Run a free Cockpit scan to see what structured data your site is currently missing.

L

LLM Discovery

LLM Discovery refers to how large language models find, access, and index your website's content. Unlike traditional search engine indexing, which is well-documented and supported by decades of SEO practice, LLM discovery is a newer challenge — and the rules are still being established.

Discovery happens through multiple channels: AI crawlers that directly visit your site, training data from web archives like Common Crawl, and retrieval-augmented generation systems that fetch content in real time when answering questions. Each channel has different access mechanisms, different update frequencies, and different implications for how current the AI's knowledge of your business is.

aystos helps you optimize for all discovery channels. The Cockpit checks crawler access, structured data quality, and llms.txt presence. The Client manages these technical signals from inside your CMS. The goal is simple: make sure that when someone asks AI about your business, the model has accurate, current information to work with.

llms.txt

llms.txt is a proposed standard for a Markdown-based file that lives at the root of your website (like robots.txt) and serves as a manifest for AI models. While robots.txt tells crawlers what they can access, llms.txt tells AI what your site is about and how to navigate it efficiently.

The format is intentionally simple: a title, a one-line description, an extended summary, and a structured list of links to your most important content — organized by topic. It's designed to fit within an LLM's context window, giving the model a high-density overview of your site without requiring it to crawl and parse dozens of HTML pages.

aystos was an early adopter of the llms.txt standard. The Cockpit checks for llms.txt presence and quality in every audit. The aystos Client includes a built-in llms.txt editor that lets you manage the file directly from your CMS admin panel. For a detailed implementation guide, see our llms.txt Guide.

M

Machine Readability Score

The Machine Readability Score measures how well machines can parse and understand the structured information on your website. It's the second dimension of the AI Perception Score, weighted at 25% of the total.

This score evaluates your JSON-LD structured data, Open Graph metadata, semantic HTML structure, hreflang tags for multilingual sites, and other technical signals that help machines interpret your content without relying on AI inference. The distinction matters: structured data gives machines explicit facts about your content, while unstructured HTML requires AI to infer those facts — and inference is where hallucinations happen.

Most websites score poorly on machine readability because they were built for human readers and search engine crawlers, not for AI interpretation. The aystos Cockpit identifies exactly what's missing — whether it's a missing Organization schema, incomplete Open Graph tags, or malformed JSON-LD. The aystos Client can auto-inject many of these signals directly from your CMS data, dramatically improving your score with minimal effort.

P

PerplexityBot

PerplexityBot is the web crawler operated by Perplexity AI, a search engine that combines traditional web retrieval with AI-generated answers. Perplexity has gained significant traction as an alternative to Google for question-based searches, making PerplexityBot increasingly important for businesses that want to appear in AI-generated search results.

What makes Perplexity particularly interesting is its RAG architecture: when a user asks a question, Perplexity retrieves relevant web pages in real time and synthesizes an answer with source citations. That means PerplexityBot's access to your site directly affects whether your content appears as a cited source in Perplexity's answers — not just whether the AI knows about you, but whether it actively recommends you.

The aystos Cockpit checks PerplexityBot access alongside other AI crawlers. Because Perplexity uses real-time retrieval, keeping your content accessible and well-structured has an even more immediate impact here than with models that rely primarily on training data. Ensuring PerplexityBot can read your pages is one of the most direct paths to appearing in AI-generated search results.

R

RAG (Retrieval-Augmented Generation)

RAG stands for Retrieval-Augmented Generation — an AI architecture where the model doesn't just rely on its training data to answer questions. Instead, it first retrieves relevant documents from an external source, then generates an answer based on that retrieved context. It's how AI systems like Perplexity, Bing Chat, and aystos Search can give answers grounded in current, specific information.

The "retrieval" step is what makes RAG powerful for website owners. Rather than hoping your content was included in a model's training data (which might be months or years old), a RAG system fetches your actual pages in real time. If your content is well-structured, accessible, and clearly written, it's more likely to be retrieved — and more likely to be cited in the AI's answer.

aystos Search is built on a RAG architecture. It indexes your site's content, retrieves relevant sections when visitors ask questions, and generates AI answers with citations linking back to your pages. The Cockpit helps you optimize your content for RAG retrieval by improving crawlability, structured data, and content clarity.

Robots.txt (AI Context)

You probably already know robots.txt as the file that tells search engine crawlers which parts of your site they can access. But robots.txt has taken on new significance in the AI era: it's now the primary mechanism for controlling which AI crawlers can read your content.

Here's the problem: most robots.txt files were written years ago with only Googlebot and Bingbot in mind. They might not mention GPTBot, ClaudeBot, PerplexityBot, or CCBot at all — which usually means these crawlers have access by default. Conversely, some overly restrictive configurations accidentally block all AI crawlers, making the site invisible to every major AI model.

The aystos Cockpit analyzes your robots.txt specifically for AI crawler implications. It checks access rules for every major AI bot, identifies inconsistencies (like allowing one AI crawler but blocking others), and provides specific recommendations. The aystos Client includes a remote robots.txt editor so you can make changes without touching server files or waiting on a developer.

S

Schema.org

Schema.org is a collaborative vocabulary standard — originally created by Google, Microsoft, Yahoo, and Yandex — that provides a shared language for describing things on the web. It defines types like Organization, Product, Article, Event, and hundreds more, each with specific properties that describe their attributes.

When you add Schema.org markup to your website (typically via JSON-LD), you're giving machines explicit, unambiguous information about your content. Instead of hoping that AI correctly infers your business type from your "About" page, you're stating it directly: "This is an Organization. Its name is X. It offers these services. It's located here." That explicitness dramatically reduces hallucination risk and improves alignment.

Schema.org remains the most widely supported structured data vocabulary on the web — used by search engines, AI models, voice assistants, and knowledge graphs alike. The aystos Client auto-generates Schema.org markup for your entire site using data from your CMS. The Cockpit audits your existing Schema.org implementation and identifies gaps, errors, and opportunities to add more specific type information.

Semantic Search

Semantic search is search that understands meaning rather than just matching keywords. When someone searches for "how do I fix a leaky faucet," a keyword-based search looks for pages containing those exact words. A semantic search understands the intent — the user wants plumbing repair instructions — and can match pages titled "Bathroom Plumbing Repair Guide" even if they never use the word "faucet."

Semantic search works by converting both the query and your content into mathematical representations (called embeddings) that capture meaning. Similar meanings produce similar embeddings, so the search can find relevant content based on conceptual similarity rather than word overlap. This is the same technology that powers how AI models understand language — applied to your website's search experience.

aystos Search uses a hybrid approach: semantic vector search for understanding meaning, combined with traditional full-text search for catching exact terms and product names. Results from both are fused and re-ranked across 15+ signals. The result is a search experience that understands your visitors' questions the way a knowledgeable human would — but at machine scale.

Structured Data

Structured data is machine-readable metadata embedded in your web pages that explicitly describes what your content is about. Instead of relying on AI to figure out that your page is about a product with a specific price, availability, and manufacturer, structured data states these facts directly in a format machines can parse without interpretation.

The most common format for structured data on the web is JSON-LD using the Schema.org vocabulary. Other formats exist (Microdata, RDFa), but JSON-LD has become the standard recommended by Google and increasingly important for AI readability. Structured data powers rich search results, knowledge panels, voice assistant answers, and — critically — helps AI models make accurate claims about your business.

Structured data is one of the highest-leverage improvements for AI visibility. Sites with comprehensive, accurate structured data score significantly higher on the Machine Readability Score and experience fewer hallucination issues. The aystos Client auto-generates structured data for every page on your site from your CMS content. The Cockpit audits what you have, identifies what's missing, and shows you exactly how it affects your AI Perception Score.

See How AI-Ready Your Website Is

Run a free Cockpit scan and find out what AI models actually understand about your business.

Start Free Cockpit Scan →