REPORT·AI Search

Your AI Visibility Metrics Are Lying to You: How Conversational Context Shapes AI Responses

AI names the same brands whichever path a buyer takes, but frames them differently, steering them to different vendors. New research on why prompt tracking misses the crucial detail of AI responses.

22 June 2026Tom Rudnai

Jump to section

TL;DR

We asked an AI the same business question many times, changing only one earlier turn in the conversation: the "lens" the buyer came in through.
The brands it mentioned stayed consistent, particularly the leaders. How it framed those brands (the strengths, weaknesses and ultimate recommendation) changed significantly.
Visibility trackers would show a win for every brand mentioned - meanwhile your buyer is being pushed very clearly in one direction.
The effect is seen across all categories tested, but is strongest in more mature and consolidated categories.
Implication: existing visibility tracking is extremely flawed. It treats prompts as an isolated event, where in reality the user’s specific conversational context has a huge bearing on what actually counts: the recommendation.

Introduction

The assumption we set out to test

This study set out to answer a question that arose off the back of our original Dark AI research. This research proved the flaws of citation tracking as an AI search metric, particularly for non-transactional categories like B2B. Just 16% of AI responses in B2B journeys cite a brand directly. As we know the majority of a B2B buyer journey lies in the awareness and consideration stages - framing the problem, building and achieving consensus on requirements - and at those stages we observed a 0% citation rate.

Since then we’ve argued that AEO strategies must be multifaceted. Yes, you want to be “visible” in the 16% of prompts where a citation is possible. What we call the ‘tip of the iceberg’ in terms of AI’s influence on your buyers’ journey - the minority that is visible above the surface of existing metrics.

But you also want to exert category-level influence beneath the surface - what we term “Dark AI” - in order to shape the category as a whole in your favour. The logical next question was to understand the extent to which those two phases - above and below the surface - are interlinked. Specifically, to what extent does the way AI conceptualises and frames a problem impact its eventual recommendation?

We also wanted to answer a nagging question we had in relation to prompt tracking as a means of AI search measurement. Unlike a keyword, a prompt is not an isolated event. It is a tiny window into a bigger conversation. It carries the context of that preceding conversation. Yet we study that tiny window in isolation, ignoring all the other elements that influence our buyers’ actual experience.

It’s like finding a needle, and concluding the haystack it resides in is actually a pile of needles.

A row of seven message boxes representing a conversation; only the final box is highlighted and labelled "what tracking measures," while a bracket spanning all seven is labelled "the conversation that actually shaped it."

Again, this is particularly relevant for complex B2B categories, since the more complex the problem, the more likely it is that any “conversion prompt” you track will in fact follow a long period of sense-making, both within and across sessions.

How we tested it

You’ll find the full methodology and prompt-set at the bottom of this piece, but in summary:

We ran a series of five prompts consecutively to simulate a straightforward conversation from problem → solution.
Prompts 1, 3, 4 and 5 were held constant.
Prompt 2 (”the lens”) was adapted to apply different personas / requirements, creating multiple conversational pathways.
We ran each path multiple times, comparing the responses to one another and to a more generic “control” pathway. We ran this across multiple different categories, using ChatGPT.

We sampled 8 different categories, centring on either data, technology or strategic problems so that we could get a sense of how the effect varied in different problem-spaces.

Note: we are publishing our initial findings, given their importance. We now intend to extend this across other models. Historically, we’ve seen subtle behavioural variations between models but never fundamentally different behaviour.

Defining visibility vs framing

Before getting into the findings, it’s useful to differentiate two terms which we will use deliberately throughout.

Visibility is simply whether an AI model talks about you. This is what every AEO tool measures today, counting the number of times a brand’s name is mentioned or cited (although at Demand-Genius we do extend this to other entities such as products, events, thought leaders or trademarked terminology).

Framing is how the model talks about you. How does it present your strengths, weaknesses and trade-offs? Are you the obvious fit or the hedged afterthought? Is the category being defined around the criteria you win on, or the ones a competitor wins on?

Visibility is whether you're in the room. Framing is what's being said about you once you are.

Two AI responses can - and as we will show, do - name exactly the same brands and tell completely different stories about them.

Across our study, visibility remains relatively constant whichever path the conversation took. The same brands get mentioned. But framing moved a great deal.

Which means both brands’ prompt tracking will chalk that response up as a win, while buyers are funnelled very clearly to one over the other. That is why the study is so important to understand, because it means that your AI search measurement may very well be lying to you.

How do you measure Framing? Frame retention ratio (FRR). Every lens we introduced early in a conversation puts a set of ideas on the table - particular concepts, priorities and language. Our main measure, the frame retention ratio, asks how many of those ideas are still load-bearing in the model's final answer. If an early lens introduced ten framing concepts and four survive into the conclusion, the frame retention ratio is roughly 0.4. High retention means the lens stuck - the model carried that early framing all the way through to its recommendation. Low retention means it let go along the way. This allowed us to measure whether the frame from prompt 2 was carried forward into the subsequent conversation.

Headline takeaways

The brands stay the same

To measure brand recurrence - whether the shortlisted brands change - we use a metric from our Dark AI research, assigning scores to capture how consistently the leading brands recur. K1 for the single top brand, K3 for the top three, K5 for the top five, so a high K-score means the leaders are stable and a low K means they are volatile. Here we measure it against the control: how far does changing the lens shift which brands appear, and does that shift reach the leaders or only the long tail?

Across the sample, the lens has little effect on visibility. There is some movement, but it sits almost entirely in the long tail of mentioned brands, not the leaders. At K3 most categories barely move: the aggregate is 0.82, and even that is pulled down by two outliers we return to later. Set those aside and the leading brands recur almost every time, whichever lens the buyer arrived through.

This is important. As far as a prompt-tracking dashboard is concerned, every one of these pathways is a win for the brands involved; the category leaders are all visible. The same names keep surfacing, the dashboard lights up green, and nothing in the data hints at the underlying problem.

The story doesn't

What changes is the framing. Frame retention averages 0.37 across paths: a material share of the concepts the model introduces back in prompt 2 survive the whole conversation and are still load-bearing by the end. That is more than enough to move the recommendation. Pushed to commit at prompt 5, those retained concepts pull the answer in different directions. The same brands on the shortlist, but framed differently to lead the buyer to a very different impression.

To make that concrete, it's worth walking through what this looks like inside a single category most readers are likely to be familiar with, before we step back to the overall picture and what it means.

What this looks like in practice: marketing technology

A five-row flow diagram tracing the CRM walk-through. Prompt 1 is a shared opening question that branches at Prompt 2 into five lenses (Attribution, Analytics, Demand gen, Lifecycle, ABM) plus a Control. Prompt 3 shows each lens's emphasis, Prompt 4 shows frame retention scores (0.22, 0.26, 0.36, 0.31, 0.50), and Prompt 5 shows the final recommendation—HubSpot for Attribution and ABM, Salesforce for the rest. Caption: the same brands surface across paths, but the framing, analysis and recommendation change.

Prompt 1: Common starting point

"We're reviewing our marketing stack because campaigns, reporting and customer engagement feel disconnected. How do companies usually modernise marketing operations?"

Every pathway opens with the same question. We're not testing different conversations; we're testing how the route through one conversation changes where it lands.

Prompt 2: Applying the specific lens

Here we change the lens: the pain point through which the model is asked to read the problem. Five lenses, plus a deliberately generic question as the control.

Attribution: "We struggle to understand which activities drive pipeline and revenue. How do companies solve attribution?"
Analytics: "Marketing reporting is inconsistent and difficult to trust. How do companies improve this?"
Demand generation: "Generating pipeline consistently is difficult. How do companies improve demand generation?"
Lifecycle: "Customer journeys and lifecycle programmes feel fragmented. How do companies improve lifecycle marketing?"
ABM: "We want to target accounts more effectively rather than running broad campaigns. How do companies usually approach this?"
Control: "What marketing challenges do mid-market B2B companies usually face?"

Prompt 3: Attempted reset

"If this worked well, what outcomes or capabilities would companies expect?"

This prompt was held identical across every path, and deliberately free of any request for a recommendation. It simulates the requirement-building phase of a real buying journey, and it creates distance between the lens and the eventual ask. That distance helps us measure how the early frame shows up even after we have tried to push the conversation back down a more neutral path.

Prompt 4: Same brands are visible, but completely different winners

"What marketing platform approach would you recommend for a mid-market B2B company and which vendors should be considered?"

Now we ask for a shortlist. On the surface, the paths agree. Brand recurrence is 91%, meaning the top 3 brands in the control pathway almost always recur regardless of the lens. This means each of those brands' prompt-tracking dashboards would log a win across all pathways - by the only measure most AEO tools capture, these are identical conversations.

Underneath, they are not. Frame retention ratio averages 0.33 across the five paths, ranging from 0.22 to 0.50. Around a third of the original framing has survived into the recommendation. It is materially influencing the solution-exploration conversation.

Prompt 5: Crystallising the impact

"Given everything in this conversation, what solution, operating approach, and platforms would you recommend now?"

This wasn’t originally part of the study, but we re-ran it with this included. The purpose was to crystallise the impact frame retention ratio has on the impression the buyer will receive. We wanted to see if the effect held true when we got out of complex metrics like frame retention ratio and just asked, outright, what solution the conversation points to as the best fit.

Frame retention rises from 0.33 to 0.49, even as we get further from the diverging prompt. (Although it’s worth mentioning that this prompt openly invites the model to draw on "everything in this conversation", so treat this as a deliberately amplified view of the trend observed in prompt 4). Crucially, the recommendation splits. We saw roughly 60% convergence on a single brand: three paths land on Salesforce, the other two on Hubspot.

Again, both brands appear in every answer, so both prompt-tracking dashboards will still be logging 100% of these responses as a win even though the model has just pointed two buyers towards different vendors.

This isn’t a behavioural study and we don’t have data on exactly how this influences buyers. But it's reasonable to expect this at best entrenches a front-runner in the buyer's mind, and at worst leads them to speak first (or only) to the preferred vendor.

Variation across the 8 categories

In short, the effect holds true and is meaningful in all eight categories we tested, though it does vary in its significance and there are a few outlier categories worth exploring in more detail.

Frame retention ratio varies significantly between different pathways, although when aggregated across each category the view is relatively stable, ranging between 0.27 and 0.46. This makes sense; different lenses will deviate further from the control. Across all categories, though, the commonality is that pathways deviate enough to create a meaningful shift in the framing of subsequent responses. The impact of that volatility is seen clearly in the way most categories' lenses split on the final recommendation.

Brand recurrence is more volatile at the extremes (range: 0.42 - 1.00) but relatively consistent when you eliminate obvious outliers. 6 of 8 categories show brand recurrence of 0.87 or higher, meaning that the top 3 brands in the control pathway recur across alternative pathways. The shortlist doesn’t change, only the way it is framed and which name is positioned best within it.

Outliers to this trend are the more abstract categories that are less consolidated around a specific set of solutions. AI strategy and Publisher monetisation both show much more volatility. There’s an important lesson in this: there is no one-size-fits-all AEO playbook. Your strategy must adapt to the category in which you operate. There is meaningful instability in which brands are visible (and therefore an opportunity for brands to claim it) in more open and strategic conversations that don’t resolve clearly to an established, mature category of solutions.

Why this shouldn't surprise anyone

These findings are really important for brands prioritising AI visibility, relying on prompt tracking and counting citations and mentions to measure their brand’s visibility and their strategy’s impact. This study should push brands to adopt a more nuanced approach to measurement. They are not altogether surprising, though, when you think about how AI models operate and based on some of our previous research.

Prompts are not isolated events

One of our personal frustrations with the prompt tracking industry is that it ignores important differences between a keyword and a prompt. A keyword forces the user to consolidate their intent into a short string that exists in isolation. A prompt is long, complex, verbose, carries criteria and is interwoven to the conversational context in which it sits.

When you type a question into an AI assistant, the model doesn't see only that question. Generally it receives the user prompt alongside a system prompt which is the same for all users and conversations, and context which is comprised of various different elements.

A diagram showing the three inputs an LLM receives before producing an answer: the System Prompt (set by the vendor), Context (containing configuration, memory, conversation history, metadata, and documents fetched via retrieval/grounding such as web search and knowledge bases), and the User prompt (what you type). All three feed into the LLM, which produces the Answer—illustrating that the user prompt is only one of several inputs shaping the response.

So two buyers who enter the same user prompt are not, from the model's point of view, asking the same thing. Each question is read through everything that came before it, including instructions neither buyer ever saw (e.g. memory). For anyone trying to measure AI visibility, that's the first problem: every user sees something different. If you don't control the context and the conversation around it, you aren't measuring what your buyer actually experiences.

We wrote an explainer on system prompts, context and why they matter for AI visibility tracking if you want to learn more.

Earlier turns condition later ones

The deeper answer lies in how an LLM is built. AI systems are autoregressive: they produce each answer by predicting what comes next given everything before it. Think of them like the most advanced predictive text you’ve ever seen.

Expecting a context-free answer from a system designed to build on its context is obviously flawed. It's like asking a colleague where to eat right after ten minutes of telling them how much you dislike spicy food: they won't suggest the Sichuan place, even if in a vacuum it's their favourite.

Why visibility and framing differ

That leaves one question: if everything is conditioned on the path, why does the shortlist stay stable while the framing moves? Well, it’s impossible (or at least we don’t know how) to say for sure, but our research hints at some hypotheses we will investigate further.

AI is risk averse for decisional queries. Our Dark AI study showed a fascinating pattern of convergence as the user’s query moves from problem-oriented to decision-oriented. Brand count goes from 0.67 at awareness to 6.33 at conversion, but variability falls to near-zero. In other words, the model becomes very conservative in the brands it recommends, honing in on only the most credible options. This explains why visibility is relatively unaffected - there are only so many highly credible options in the CRM category. The viable shortlist doesn’t change, but the substance of the answer, how it frames that shortlist and the ultimate recommendation does.
Framing, however, is free from those constraints. We know from the same Dark AI study that language shifts from exploratory to directive as prompts progress down-funnel. It becomes more decision-oriented and opinionated in a pattern we call intent matching. AI is smart: as the user looks to move towards a decision, the AI recognises and adapts to that intent. So while the list of options remains stable the AI helps move the user towards a decision by expressing its opinion in the surrounding frame and analysis of those options.

Why this hits B2B hardest

If this study weakens the case for prompt tracking, it weakens it most in complex categories like B2B and high-consideration consumer purchases. The harder the decision, the longer the conversation that precedes the prompt you're tracking. Awareness and consideration are far more prominent in purchasing a new Billing solution than a tube of toothpaste. And it's that preceding conversation - all the messy problem framing and requirement building - that this data shows has a major bearing on the ultimate recommendation.

B2B buying is mostly a framing exercise

There’s an old truism in the world of B2B sales that if you are sent a request for proposal (”RFP”), you’ve already lost the deal. Because your competitor was in the meetings helping the customer understand the market, their own requirements, and ultimately write an RFP that reflects their own worldview - their desired “framing”.

A B2B purchase rarely starts with "which product should I buy?". It starts with "what's actually wrong, and how are other people solving it?". It’s the bulk of the journey, and it's increasingly done with AI. This study shows the extent to which that process carries through into the eventual recommendation. If your brand - your story, research and thought leadership - isn’t influencing requirements in your favour, someone else’s is.

That gives a simple rule of thumb. The more framing a category demands, the more divergent the paths a buyer can be led down before they ever ask for a shortlist. A commodity purchase with one obvious answer is hard to steer. A complex, multi-stakeholder decision with real trade-offs (most of B2B) is full of forks, and each one leaves a fingerprint.

Most tooling is built to reflect the “context” of the average consumer

We have written about this in more detail in a short explainer (What AI Actually Sees: System Prompts, Context, and AI Visibility Tracking), but the majority of AI prompt tracking tools (even and especially the more sophisticated ones) are built to simulate the average consumer experience so that their data is broadly applicable and prioritises their primary ICP.

If you’re Colgate or Nike, it’s a good thing - you want your tracking to reflect the most common consumer experience. If you’re a specialist B2B brand, it’s a big problem, because your entire value proposition is built around being a specialist tool for a specific type of buyer. If your tooling doesn’t reflect the experience of your specific ICP, and this study shows the impact that has on the outputs, it means your current prompt tracking insights are completely unreliable.

What it means (implications)

The findings in this study are extremely important for AI search strategy, as it undermines what has become the widespread standard of measurement. Prompt tracking, by nature, assumes that each prompt is an isolated event that can be replicated to get an accurate representation of what users see. In reality, every user prompt is intricately connected to its context and what you see in your dashboard may completely misrepresent how AI is shaping the market for or against you.

For AI tracking

Three things we’d recommend ensuring your AI tracking enables, whether it's a third party tool like Demand Genius or internal, would be:

Measure your buyers’ experience. The challenge with most AEO tracking tools is that they either use the API, which strips out all context, or they run virtual machines to replicate a user’s experience. The latter works for consumer brands that want to measure the average consumer experience but not for B2B brands that want to replicate the context of their specific buyers. You need a B2B-specific tool that lets you control the type of user whose experience you are simulating, ideally across different segments.
Cluster prompts by buyer intent. If we cannot accurately represent our buyers' experience in a dashboard, we must adapt measurement to recognise that anything we see is indicative only. Cluster tracked prompts around specific intents, jobs-to-be-done or category entry points to get a sense of how you show up in the “moments” that define your success.
Measure how you appear as well as if you appear. Recognise visibility as just one goal of AEO, alongside category-level influence (shaping problem framing and requirements through thought leadership and research) and optimising your own positioning in AI responses. Does the model present you in the right way to the right buyers? Sentiment analysis is a good start but it's important to remember that sentiment in LLMs is inherently positive and will vary for different buyers depending on their requirements and the conversational context that has been set.

For AI strategy

Visibility is the wrong headline goal. The model can say "Brand X isn't a good fit" and your dashboard still lights up green. Yes, you want to be visible for critical decision-oriented prompts, but it’s also important to shape how you appear. Does the AI present you in the way you’d train one of your sales reps to?

Cultivating AI perception requires a very different strategy to the one most of the rebranded SEO experts promote. Instead of creating content designed to rank 1:1 against specific prompts, it's about cultivating your digital footprint in its entirety to communicate who you are, who you’re for and how you win. In fact, scaling content aggressively to create content for every prompt variation can be very counterproductive, encouraging brands to scale content with AI and in doing so, reduce clarity, consistency and differentiation. We’d recommend checking out Lily Ray’s study, It works until it doesn’t, for some real world examples of the damage this can do to your brand and performance.

One challenge is that cultivating your overall digital footprint bridges more disciplines. Yes, an SEO-style optimisation skillset has value but it also touches product marketing, content, brand and PR. This is one reason why we believe outside support is helpful, simply because it requires a broad and generalist skillset that can be difficult to hire for and train in a single individual.

For brand & positioning

Any positioning or repositioning exercise now has a dimension most teams ignore: the model's view of you. 89% of B2B buyers use AI in the buying process, which means for many potential customers their first and sometimes only touchpoint with your brand will be with AI.

A two-panel comparison. The left panel, "What you see," shows a bold, colourful book-cover graphic labelled "What you ship: a fresh cover." The right panel, "What AI sees," shows a long document where the new cover is just a small highlighted strip at the top and the rest is plain text, labelled "Everything AI reads"—illustrating that a rebrand's "shop window" is only a fraction of what AI actually reads.

When you pivot, rebrand or sharpen your positioning to find or hold product-market fit, you need to have a plan to communicate that pivot to AI and ensure that the “new you” is not presented as a footnote.

We wrote a full piece on rebranding in the AI era here, but the big challenge is that you cannot simply change the “shop window” (Home page, product pages, a few pillar pieces of content). That shop window may be +90% of what a human sees, but to AI it looks like exactly what you want to avoid: a footnote.

For sales

Buyers arrive to sales conversations with preconceptions, and those preconceptions are largely formed through conversations with AI. Anecdotally, we speak with sales teams all the time who say that prospects specifically attribute objections to AI. “Gemini said you’re slow to integrate” or “Claude said you’re overpriced”.

Sales teams need to understand the preconceptions that AI is feeding to your buyers, so that they can be prepared to pre-empt and ultimately overcome those objections. Sales leaders also need to understand where AI misconceptions can become deal risks. Some decision makers might never interact with your sales team directly. Think of the CFO about to sign off on a 6-figure deal near the end of your quarter, who decided first to ask ChatGPT whether your product is good value for money. What will it tell them, and do you think that has the potential to block or at least delay your deal?

Methodology

How we built this

Design

Each test holds the conversation's overall question constant and changes exactly one earlier turn: the lens. A shared opening question (prompt 1) sets the topic. The lens (prompt 2) introduces one pain point through which to read the problem, with five lenses per category plus a deliberately generic control. A shared, recommendation-free turn (prompt 3) simulates the requirements-building phase and puts distance between the lens and the eventual ask. The original four-prompt design ended at a shared recommendation request; an evolved design added a fifth prompt asking the model to commit to a final recommendation. Each path was run three times to gauge run-to-run noise, across eight B2B categories. The model was OpenAI's gpt-5.3-chat.

How we measured framing

We used three retention measures: frame retention ratio (the share of the lens's prompt-2 concepts still present in the answer), frame share of answer (how much of the answer is built from those concepts) and frame lock score (the raw count). The core test is comparative: are answers that travelled the same lens more similar to each other than answers from different lenses? We compared within-path against between-path pairs, with permutation and bootstrap checks for whether the gap is more than noise. For visibility, we assess how many of the brands present in the control response recur in the different pathways. K1 measures whether the top brand reappears, K3 to first three and K5 the first five.

Limitations

This study does, of course, have its limitations and we think it’s important to state them clearly. We intend to investigate further, and encourage others to do so as well.

This is a single model study, using Open AI’s GPT-5.3.
The sample size for different pathways was 3. In our experience, this is plenty to derive clear signals, but of course more runs would reinforce the findings.
8 categories is enough to provide a clear directional signal and statistical significance for the overall trends, but does not definitively support specific learnings on category-level variation.
Prompt 5 is deliberately leading, beginning with “based on this conversation”. This is because the purpose is to crystallise and illustrate the findings which otherwise rely on complex metrics like brand recurrence and frame retention ratio. The primary purpose of this study is to understand the impact on framing, not any one category’s winner.

What’s Next

The most obvious next step we intend to conduct is replication across models and providers. Everything here is a single model; the obvious question is whether the path fingerprint is a quirk of one system or a general property of how these models work. We expect the latter - it is a logical consequence of how all autoregressive models are built - but we’d like to understand any differences that exist.

We'll also widen the lens: more categories, to map where the effect is most pronounced, and a tighter framing taxonomy, so the coarse top-line labels stop masking movement we can already see in the underlying language.

If you're measuring AI visibility, running AEO for a complex category, or just sceptical of what we've found, we'd like to hear from you - including, especially, where your data disagrees with ours. The fastest way to pressure-test a finding is to put it in front of the people it's about. Consider this an open invitation to reproduce, critique or build upon our findings!