Quantifying Uncertainty in AI Visibility: A Statistical Framework for Generative Search Measurement

AI-powered answer engines are inherently non-deterministic: identical queries submitted at different times can produce different responses and cite different sources. Despite this stochastic behavior, current approaches to measuring domain visibility in generative search typically rely on single-run point estimates of citation share and prevalence, implicitly treating them as fixed values. This paper argues that citation visibility metrics should be treated as sample estimators of an underlying response distribution rather than fixed values. We conduct an empirical study of citation variability across three generative search platforms--Perplexity Search, OpenAI SearchGPT, and Google Gemini--using repeated sampling across three consumer product topics. Two sampling regimes are employed: daily collections over nine days and high-frequency sampling at ten-minute intervals. We show that citation distributions follow a power-law form and exhibit substantial variability across repeated samples. Bootstrap confidence intervals reveal that many apparent differences between domains fall within the noise floor of the measurement process. Distribution-wide rank stability analysis further demonstrates that citation rankings are unstable across samples, not only among top-ranked domains but throughout the frequently cited domain set. These findings demonstrate that single-run visibility metrics provide a misleadingly precise picture of domain performance in generative search. We argue that citation visibility must be reported with uncertainty estimates and provide practical guidance for sample sizes required to achieve interpretable confidence intervals.

翻译：AI驱动的问答引擎本质上具有非确定性：相同查询在不同时间提交可能产生不同回答并引用不同来源。尽管存在这种随机行为，当前测量生成式搜索中领域可见性的方法通常依赖单次运行的引用份额与出现频率点估计，隐含地将它们视为固定值。本文论证，引用可见性指标应被视为底层响应分布的样本估计量而非固定值。我们针对三个生成式搜索平台——Perplexity Search、OpenAI SearchGPT和Google Gemini——就三个消费品主题开展重复采样的实证研究，采用两种采样方案：连续九天的每日采集与每十分钟的高频采样。研究表明，引用分布服从幂律形式且在重复样本间呈现显著变异性。Bootstrap置信区间显示，许多领域间的表观差异落在测量过程的噪声基底范围内。全分布秩稳定性分析进一步表明，不仅头部领域之间，乃至频繁被引的领域集合内的引用排名在样本间均不稳定。这些发现证明，单次运行的可见性指标会对生成式搜索中领域性能产生具有误导性的精确描述。我们主张引用可见性必须附带不确定性估计进行报告，并为达到可解读置信区间所需的样本量提供实践指导。