ResilientNiche
← Blog10 min read

Does ChatGPT Use My Website? How AI Engines Find and Cite Sites

Two paths get your site into AI answers: live retrieval and training data. You control one. Here is how ChatGPT, Claude, Perplexity, and Copilot actually find you.

Photo of Malik Browne

Malik Browne

Built BakingSubs to 162,500 Copilot citations and accelerating. Now teaching the system behind it.

  • ai-visibility-general
  • chatgpt
  • strategy

Yes, ChatGPT can use your website, but probably not in the way you think. There are two separate paths a site takes into an AI answer, and only one of them is something you can actually influence. If you understand the difference, you stop worrying about the wrong thing and start fixing the right thing.

Key takeaways

  • AI engines reach your site through two paths: live retrieval (the engine browses the web in real time) and training data (snapshots of the web baked into the model months or years ago).
  • You cannot edit training data. You can absolutely influence whether the live-retrieval path finds and quotes you.
  • If your robots.txt blocks GPTBot, ClaudeBot, or PerplexityBot, you are invisible to those engines on the live path no matter how good your content is.
  • ChatGPT, Claude, Perplexity, and Copilot all behave differently. Copilot in particular leans heavily on live retrieval, which is why BakingSubs has earned 162,500 Copilot citations, with 112,500 of those in just the last three months.
  • The fix is not "rank for AI." It is making your pages crawlable, clearly attributed to a real author, and easy for an engine to quote in one or two sentences.

The two ways your website ends up in an AI answer

Your site can enter an AI response through training data or through live retrieval, and the practical difference matters more than the technical one. Training data is a frozen snapshot of the web that the model learned from during its build. Live retrieval is what happens when the engine browses the web while answering you, pulls fresh pages, and quotes from them in real time.

If your site existed and was crawlable before the training cutoff, fragments of it may sit inside the model's weights. You cannot see them, edit them, or remove them. You also cannot reliably get the model to cite them, because training data shows up as paraphrased general knowledge, not as a named source.

Live retrieval is the path that actually creates citations. When someone asks ChatGPT, Claude, Perplexity, or Copilot a question and the engine decides to browse, it sends a crawler out, reads pages, and pulls quotes with links back. That is the path where you get named. That is the path where buyers click through to your site. And that is the path you can influence.

The fastest way to find out whether any engine is quoting you today is to run a structured check. You can do that with the AI Visibility Check or by hand using the free 60-second test.

What each engine actually does differently

Each of the four engines uses a different mix of training data and live retrieval, and that mix changes what you should focus on. The behavior is not identical, even though the surface output looks similar.

ChatGPT uses both paths. When you ask it a general question, it often answers from training data with no citations. When you ask a current question or one where it senses freshness matters, it browses and cites sources. The engine that powers ChatGPT browsing is GPTBot, and if your site blocks GPTBot, you will not appear in those browsed answers.

Claude leans on retrieval when it has access to the web, and it tends to weight pages that have clear author attribution and a focused topic. Anthropic's crawler is ClaudeBot. Claude is also more cautious about quoting pages where it cannot tell who wrote them or whether the source is credible, so generic stock-photo "about us" pages tend to lose to a real Person schema with a named author.

Perplexity is the most retrieval-heavy of the four. It is built around live search with sources displayed inline, and its crawler is PerplexityBot. If you want to understand the engine that puts sources front and center in every answer, start with how Perplexity recommends.

Microsoft Copilot also runs heavily on live retrieval, and it tends to surface niche, topically focused sites that other engines miss. This is the engine where BakingSubs has earned 162,500 citations to date, with 112,500 of those landing in just the last three months. The acceleration is the story. Copilot is rewarding sites that match a specific question better than the big generic players match it.

Why blocking AI crawlers makes you invisible (and why most sites do it by accident)

Your robots.txt file tells crawlers which parts of your site they can visit. If yours blocks GPTBot, ClaudeBot, or PerplexityBot, those engines cannot read your pages on the live-retrieval path, and you will not be cited by them. Period.

Here is the part that surprises people: a lot of sites block these bots without meaning to. A WordPress security plugin flips a switch. A site migration copies over an old robots.txt from 2022 before these crawlers existed. A developer adds a broad "disallow all bots except Google" rule. Suddenly the site is invisible to AI search and the owner has no idea why.

Imagine a small SaaS in Austin that sells inventory software for independent coffee roasters. The founder, Linnea, wrote 30 thoughtful pages about roast curves, green bean sourcing, and small-batch margin math. None of it shows up in Claude or Perplexity. She thinks her content is bad. The actual problem is one line in her robots.txt left over from a 2023 spam migration that disallows every bot except Googlebot. Once she removes the line, ClaudeBot starts crawling within days, and inside a few weeks Claude is quoting her roast-curve page when buyers ask about inventory software for roasters.

The check is simple. Open your site's robots.txt (just add /robots.txt to your domain). Look for any line that mentions GPTBot, ClaudeBot, PerplexityBot, CCBot, or "User-agent: *" followed by "Disallow: /". If you see any of those blocking AI crawlers, fix it. If you want help diagnosing what is keeping you out, the signs your business is invisible to AI search post walks through the common patterns.

What "controllable" actually means in practice

You cannot edit what ChatGPT learned about you in 2023. You can absolutely shape what it quotes about you today. The live-retrieval path is influenced by five things that are all within your reach.

First, make sure AI crawlers can read your site. That is the robots.txt fix above.

Second, write pages that answer a specific question in the first one or two sentences, then go deep. Engines pull the opening sentences as the quote, so the opening has to stand alone. If a buyer typed your H2 into ChatGPT, would the next sentence answer them? If not, rewrite it. The post on what makes a page quotable breaks this down.

Third, attribute every page to a real human author. Add a Person schema (the hidden tag that tells AI engines this page is about a real person, not a brand). Link the byline to a real author bio page with credentials, photo, and contact info. Engines weigh author signals heavily, especially Claude.

Fourth, cluster your content around one topic instead of writing one-off posts about everything. This is the core of the Citation Cluster Method. A site that has 30 connected pages about one specific thing gets quoted more often than a site that has 30 pages about 30 different things. BakingSubs is proof. The site is not large by general-web standards. It is dense within one niche, which is what made 162,500 Copilot citations possible.

Fifth, do not chase scale. The biggest site does not win AI citations. The most topically focused one does, which is why AI often recommends the smaller competitor.

What the four-engine difference means for your roadmap

If you are starting from zero, do not try to optimize for all four engines at once. The path that gives you the fastest signal is fixing crawler access, then writing one tightly focused cluster of pages that answers the specific questions your buyers ask. Copilot will usually be the first to start citing you. Perplexity often follows. Claude takes longer because it weights author signals more. ChatGPT's browsed citations come in once your pages are clearly the best match for a current query.

Here is the contrarian piece most generic GEO advice gets wrong: scrambling to "get into training data" is a waste of time. The next training cutoff is months or years away, you have no way to verify inclusion, and you cannot make the model name you as a source from training data anyway. Spend your effort where the citation actually shows up with a link, which is live retrieval. The mechanism is explained in more detail in what replaces SEO when buyers stop Googling.

Frequently asked questions

Does ChatGPT crawl my website?

ChatGPT itself does not crawl in real time the way Google does. OpenAI runs a separate crawler called GPTBot that collects pages for training and for live browsing inside ChatGPT. When a user asks ChatGPT a question that triggers browsing, the engine pulls pages live and may quote yours if GPTBot is allowed to read your site.

How do I know if ChatGPT or Claude has my site in its training data?

You cannot verify this reliably. The models do not expose a lookup, and asking ChatGPT "do you know my site" is not a real test, because the model will often hallucinate familiarity. The practical answer is to stop worrying about training data and focus on whether engines cite you in browsed answers today. The 60-second test shows you how to check that directly.

Should I block GPTBot to protect my content?

Only if you are sure you do not want any AI engine to ever quote or recommend your site. Blocking GPTBot stops ChatGPT from citing you in browsed answers, which removes a growing source of buyer traffic. Most businesses lose more than they protect. If you sell expertise, the citation is the goal, not the threat.

Why does Copilot cite some small sites so much more than big ones?

Copilot's live retrieval rewards topical density and clear question matching, not raw site size. A site with 40 deep pages on one specific topic often outranks a 4,000-page generic site, because Copilot is trying to quote the page that best answers a narrow question. BakingSubs sits in this category, which is how it accumulated 162,500 Copilot citations with 112,500 in just the last three months.

Is AI visibility just SEO with new terminology?

No, and treating it that way will cost you results. The overlap is real (crawlable pages, clear structure, good content), but the mechanics, the ranking signals, and the output format are different enough that a Google-only strategy leaves most AI citations on the table. The side-by-side post walks through where the two diverge.

The next step

If you read this far, the practical move is to find out where you stand on the live-retrieval path right now. Pull up your robots.txt and confirm you are not blocking AI crawlers. Then check whether any of the four engines actually quote you today. The free AI Visibility Check runs eight buyer-style questions across the engines and tells you which path is failing and what to fix first. Once you can see the gap, the rest is just deciding which cluster to write.