Quantifying Uncertainty in AI Visibility: A Statistical Framework for Generative Search Measurement

AI-powered answer engines are inherently non-deterministic: identical queries submitted at different times can produce different responses and cite different sources. Despite this stochastic behavior, current approaches to measuring domain visibility in generative search typically rely on single-run point estimates of citation share and prevalence, implicitly treating them as fixed values. This paper argues that citation visibility metrics should be treated as sample estimators of an underlying response distribution rather than fixed values. We conduct an empirical study of citation variability across three generative search platforms--Perplexity Search, OpenAI SearchGPT, and Google Gemini--using repeated sampling across three consumer product topics. Two sampling regimes are employed: daily collections over nine days and high-frequency sampling at ten-minute intervals. We show that citation distributions follow a power-law form and exhibit substantial variability across repeated samples. Bootstrap confidence intervals reveal that many apparent differences between domains fall within the noise floor of the measurement process. Distribution-wide rank stability analysis further demonstrates that citation rankings are unstable across samples, not only among top-ranked domains but throughout the frequently cited domain set. These findings demonstrate that single-run visibility metrics provide a misleadingly precise picture of domain performance in generative search. We argue that citation visibility must be reported with uncertainty estimates and provide practical guidance for sample sizes required to achieve interpretable confidence intervals.

翻译：人工智能驱动的答案引擎本质上是非确定性的：在不同时间提交的相同查询可能产生不同的回答并引用不同的来源。尽管存在这种随机行为，当前测量生成式搜索中领域可见性的方法通常依赖于引用份额和流行度的单次运行点估计，隐含地将其视为固定值。本文认为，引用可见性指标应被视为底层响应分布的样本估计量，而非固定值。我们通过对三个生成式搜索平台——Perplexity Search、OpenAI SearchGPT 和 Google Gemini——在三个消费品主题上进行重复抽样，开展了一项引用变异性的实证研究。采用了两种抽样机制：为期九天的每日收集和十分钟间隔的高频抽样。我们表明，引用分布遵循幂律形式，并在重复样本中表现出显著的变异性。自助法置信区间显示，许多领域间的表面差异落在测量过程的噪声基底之内。全分布排名稳定性分析进一步证明，引用排名在样本间是不稳定的，不仅限于排名靠前的领域，在整个频繁被引用的领域集合中均是如此。这些发现表明，单次运行的可见性指标对生成式搜索中领域性能的描述具有误导性的精确度。我们认为，引用可见性必须附带不确定性估计进行报告，并为实现可解释的置信区间所需的样本量提供了实用指导。