The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive

We report a striking statistical regularity in frontier LLM outputs that enables a CPU-only scoring primitive running at 2.6 microseconds per token, with estimated latency up to 100,000$\times$ (five orders of magnitude) below existing sampling-based detectors. Across six contemporary models from five independent vendors, two generation sizes, and five held-out domains, token rank-frequency distributions converge to the same two-parameter Mandelbrot ranking distribution, with 34 of 36 model-by-domain fits exceeding $R^{2} = 0.94$ and 35 of 36 favoring Mandelbrot over Zipf by AIC. The shared family does not collapse the models into statistical duplicates. Fitted Mandelbrot parameters remain cleanly separable between models: the cross-model spread in $q$ (1.63 to 3.69) exceeds its per-model bootstrap standard deviation (0.03 to 0.10) by more than an order of magnitude, yielding tens of standard deviations of separation per few thousand output tokens. Two capabilities follow. First, statistical model fingerprinting: text from a vendor-delivered LLM can be tested against its claimed model family without cryptographic watermarks or access to model internals, supporting provenance verification and silent-substitution audits. Second, a model-agnostic reference distribution for black-box output assessment, from which we derive a single-pass scoring primitive that composes with model log probabilities when available and degrades to a rank-only mode usable on closed APIs. Pilot results on FRANK, TruthfulQA, and HaluEval map where the primitive helps (lexical anomalies, unsupported entities) and where it structurally cannot (reasoning errors in domain-appropriate vocabulary). We position the primitive as a first-pass triage layer in compound evaluation stacks, not as a replacement for sampling-based or source-conditioned verifiers.

翻译：我们报告了前沿LLM输出中一种显著统计规律，该规律使得仅依赖CPU的评分原语能够以每token 2.6微秒的速度运行，其估计延迟比现有基于采样的检测器低至100,000倍（五个数量级）。来自五家独立供应商的六种当代模型、两种生成规模以及五个保留领域的数据显示，token等级频率分布收敛于同一双参数曼德博分布，其中36个模型-领域拟合中有34个的$R^{2}$超过0.94，且根据AIC准则，36个中有35个更倾向于曼德博分布而非齐普夫分布。该共享分布族并未将各模型简化为统计副本。拟合的曼德博参数在各模型间保持清晰可辨：参数$q$的跨模型范围（1.63至3.69）超过其每模型自助法标准差（0.03至0.10）一个数量级以上，从而在每数千个输出token内产生数十个标准差的分离度。由此衍生出两项功能。其一为统计模型指纹识别：无需加密水印或访问模型内部结构，即可对来自供应商的LLM文本进行所属模型族验证，支持溯源验证与无声替换审计。其二为黑盒输出评估的模型无关参考分布，我们由此推导出单次评分原语：该原语可在可用时与模型对数概率结合，在封闭API环境下退化为仅依赖等级的纯排序模式。在FRANK、TruthfulQA和HaluEval上的初步实验结果标定了该原语的有效边界（词汇异常、无依据实体）及其结构性局限（领域内词汇使用中的推理错误）。我们将该原语定位为复合评估流水线中的首轮分流层，而非基于采样或源条件验证器的替代方案。