When a vendor reports that some change tripled their AI citations, the right first question isn't "how?" — it's "compared to what, and how do they know it was the change?" A useful worked example landed in June 2026: a 19-week dataset, published by the brand-visibility platform waikay.io about its own site, reporting weekly Bing AI citations rising from a 167 average to 614 — about a 3.7× lift — after it added entity markup. It's a genuinely interesting result. It's also a textbook case for reading AI-visibility studies critically.
This is part of the measuring AI visibility series, and a companion to what Bing Webmaster Tools' AI data can tell you — the free, first-party source this study is built on. The goal here isn't to judge one vendor; it's a reusable lens for the flood of "we did X and citations jumped" posts.
What did the case study report?
A large headline number, and a more complicated story underneath it. Over 19 weeks of Bing Webmaster Tools data — 11 weeks of baseline, the markup submitted in week 12, then 7 weeks after — the site reported the following, all attributed to waikay.io's own June-2026 write-up:
| Metric | Reported result |
|---|---|
| Weekly AI citations (avg) | 167 → 614 (~3.7×) |
| Citations per cited page (avg) | 3.65 → 8.55 |
| Peak weekly citations | 1,063 (week 19) |
| English-language raw volume | down ~8% |
| English citations by funnel stage | TOFU −80%, MOFU +44%, BOFU +406% |
| Share of total growth from non-English pages | ~75% (French + Spanish) |
The authors are commendably candid about the confounds: they note the non-English surge "likely" came from a deeper site re-crawl rather than the markup, and they list the limits themselves — a single domain, a short 7-week window, no controlled experiment. That candour is exactly what makes it a good teaching example.
The four questions to ask of any AI-visibility case study
Run every "we did X, citations rose" claim through these four. The waikay study passes some and openly fails others — which is the point.
- Is there a control, or just a before-and-after? A baseline isn't a control. If anything else changed in the window — a re-crawl, a freshness pass, a new backlink, an algorithm update — the before-and-after can't isolate the cause. Here the authors flag that a re-crawl plausibly drove ~75% of the lift, so the headline can't be cleanly credited to the markup.
- Who ran it, and do they sell what they measured? Independence matters. This is a vendor measuring the effect of its own product on its own site — self-reported, single-domain. That doesn't make it false; it makes it a hypothesis from an interested party, which is weaker evidence than an independent or corroborated result.
- Did the valuable citations move, or just the raw count? Look at composition, not totals. English raw volume actually fell ~8%, while citations shifted down-funnel (commercial, bottom-of-funnel pages up sharply). Whether that's good depends on your goal — and it's invisible if you only read the 3.7× headline.
- Has it been replicated? One site for seven weeks is an anecdote, not a pattern. A finding earns weight when it reproduces across domains and time. Until then, treat it as a prompt to run your own test.
A 3.7× lift with no control, on one self-reported domain, where the vendor sells the thing being measured and a re-crawl could explain most of it — that's a hypothesis, not a proof. Read the limitations section as carefully as the headline.
What's the durable takeaway, separate from the headline number?
Two things survive the scrutiny. First, Bing Webmaster Tools is a real, free, first-party place to watch your own AI citations move — which is why a study could be built on it at all, and why you can run the same before-and-after on your own site instead of trusting anyone's. Second, citation totals and citation composition are different metrics. A page can deepen — earning more citations per page, as this site did (3.65 → 8.55) — even as raw volume dips, which echoes the broader pattern that focused, well-covered pages tend to be cited more thoroughly. That's a citation-coverage story, and it's more useful than any single multiplier.
What doesn't survive is the causal leap. "We added markup and citations tripled" quietly becomes "the markup tripled citations" — and the authors' own re-crawl caveat shows why that step is unearned. The same discipline applies to the entity-strength claims that fill this space: entity signals plausibly help, but proving it needs a control, not a coincidence.
How to test a claim like this on your own site
Don't take the case study's word; reproduce the method honestly:
- Establish a clear baseline in Bing Webmaster Tools and your server logs before you change anything.
- Change one thing at a time. If you ship markup and trigger a re-crawl and add content in the same week, you've rebuilt the confound you were trying to avoid.
- Watch composition, not just the total — track citations per page and which funnel stages move, the way the reporting stack frames it, so a flat or falling raw count doesn't hide a real shift.
- Give it time, and expect noise — short windows over-read random spikes; citation activity is volatile, and decays without freshness.
- Cross-check across engines. Bing is one surface; a lift there may or may not show in Google AI Mode, Perplexity, or ChatGPT.
Reading case studies well is the same skill as measuring your own AI visibility well: insist on a baseline, separate correlation from cause, and watch composition over time across engines. Turning scattered per-engine signals into one honest, cross-engine scoreboard — so you can test what actually moves your citations instead of trusting a headline — is exactly what Buffy Intel is built to do. Questions: [email protected].