Does adding structured entity markup increase AI citations?

It might help, but no single case study proves it. One 19-week study published by waikay.io reported its weekly Bing AI citations rising from a 167 average to 614 (about 3.7×) after it added entity markup and submitted it to Bing Webmaster Tools. But the study had no control group, was a single self-reported domain, and the vendor sells the markup it measured — and roughly 75% of the growth came from non-English pages, which the authors themselves attribute to a deeper site re-crawl rather than the markup. So read it as a hypothesis worth testing on your own site, not as proven cause and effect.

What makes an AI-visibility case study trustworthy?

Four things: a control (or at least a clear baseline and a way to rule out confounding changes like a re-crawl), independence (who ran it, and do they sell what they're measuring?), composition not just totals (did the valuable citations move, or only the raw count?), and replication (has anyone reproduced it on another site?). A study can be honest and still be weak evidence if it's a single self-reported domain with no control — which describes most vendor case studies, including good ones.

Can I measure my own AI citations to test these claims?

Yes, and you should — first-party data beats a vendor's. Bing Webmaster Tools reports your AI citation activity free for verified site owners, so you can watch your own before-and-after when you change something. Pair it with your server logs (to see crawl activity) and a cross-engine view, so a single engine's numbers don't define your read. Change one thing at a time and watch a clear baseline, or you'll repeat the same confound the case studies suffer from.

How to read an AI-visibility case study (using a 19-week Bing citation test)

When a vendor reports that some change tripled their AI citations, the right first question isn't "how?" — it's "compared to what, and how do they know it was the change?" A useful worked example landed in June 2026: a 19-week dataset, published by the brand-visibility platform waikay.io about its own site, reporting weekly Bing AI citations rising from a 167 average to 614 — about a 3.7× lift — after it added entity markup. It's a genuinely interesting result. It's also a textbook case for reading AI-visibility studies critically.

This is part of the measuring AI visibility series, and a companion to what Bing Webmaster Tools' AI data can tell you — the free, first-party source this study is built on. The goal here isn't to judge one vendor; it's a reusable lens for the flood of "we did X and citations jumped" posts.

What did the case study report?

A large headline number, and a more complicated story underneath it. Over 19 weeks of Bing Webmaster Tools data — 11 weeks of baseline, the markup submitted in week 12, then 7 weeks after — the site reported the following, all attributed to waikay.io's own June-2026 write-up:

Metric	Reported result
Weekly AI citations (avg)	167 → 614 (~3.7×)
Citations per cited page (avg)	3.65 → 8.55
Peak weekly citations	1,063 (week 19)
English-language raw volume	down ~8%
English citations by funnel stage	TOFU −80%, MOFU +44%, BOFU +406%
Share of total growth from non-English pages	~75% (French + Spanish)

The authors are commendably candid about the confounds: they note the non-English surge "likely" came from a deeper site re-crawl rather than the markup, and they list the limits themselves — a single domain, a short 7-week window, no controlled experiment. That candour is exactly what makes it a good teaching example.

The four questions to ask of any AI-visibility case study

Run every "we did X, citations rose" claim through these four. The waikay study passes some and openly fails others — which is the point.

Is there a control, or just a before-and-after? A baseline isn't a control. If anything else changed in the window — a re-crawl, a freshness pass, a new backlink, an algorithm update — the before-and-after can't isolate the cause. Here the authors flag that a re-crawl plausibly drove ~75% of the lift, so the headline can't be cleanly credited to the markup.
Who ran it, and do they sell what they measured? Independence matters. This is a vendor measuring the effect of its own product on its own site — self-reported, single-domain. That doesn't make it false; it makes it a hypothesis from an interested party, which is weaker evidence than an independent or corroborated result.
Did the valuable citations move, or just the raw count? Look at composition, not totals. English raw volume actually fell ~8%, while citations shifted down-funnel (commercial, bottom-of-funnel pages up sharply). Whether that's good depends on your goal — and it's invisible if you only read the 3.7× headline.
Has it been replicated? One site for seven weeks is an anecdote, not a pattern. A finding earns weight when it reproduces across domains and time. Until then, treat it as a prompt to run your own test.

A 3.7× lift with no control, on one self-reported domain, where the vendor sells the thing being measured and a re-crawl could explain most of it — that's a hypothesis, not a proof. Read the limitations section as carefully as the headline.

What's the durable takeaway, separate from the headline number?

Two things survive the scrutiny. First, Bing Webmaster Tools is a real, free, first-party place to watch your own AI citations move — which is why a study could be built on it at all, and why you can run the same before-and-after on your own site instead of trusting anyone's. Second, citation totals and citation composition are different metrics. A page can deepen — earning more citations per page, as this site did (3.65 → 8.55) — even as raw volume dips, which echoes the broader pattern that focused, well-covered pages tend to be cited more thoroughly. That's a citation-coverage story, and it's more useful than any single multiplier.

What doesn't survive is the causal leap. "We added markup and citations tripled" quietly becomes "the markup tripled citations" — and the authors' own re-crawl caveat shows why that step is unearned. The same discipline applies to the entity-strength claims that fill this space: entity signals plausibly help, but proving it needs a control, not a coincidence.

How to test a claim like this on your own site

Don't take the case study's word; reproduce the method honestly:

Establish a clear baseline in Bing Webmaster Tools and your server logs before you change anything.
Change one thing at a time. If you ship markup and trigger a re-crawl and add content in the same week, you've rebuilt the confound you were trying to avoid.
Watch composition, not just the total — track citations per page and which funnel stages move, the way the reporting stack frames it, so a flat or falling raw count doesn't hide a real shift.
Give it time, and expect noise — short windows over-read random spikes; citation activity is volatile, and decays without freshness.
Cross-check across engines. Bing is one surface; a lift there may or may not show in Google AI Mode, Perplexity, or ChatGPT.

Reading case studies well is the same skill as measuring your own AI visibility well: insist on a baseline, separate correlation from cause, and watch composition over time across engines. Turning scattered per-engine signals into one honest, cross-engine scoreboard — so you can test what actually moves your citations instead of trusting a headline — is exactly what Buffy Intel is built to do. Questions: [email protected].