Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

This paper studies uncertainty quantification for large language models (LLMs) under black-box access, where only a small number of responses can be sampled for each query. In this setting, estimating the effective semantic alphabet size--that is, the number of distinct meanings expressed in the sampled responses--provides a useful proxy for downstream risk. However, frequency-based estimators tend to undercount rare semantic modes when the sample size is small, while graph-spectral quantities alone are not designed to estimate semantic occupancy accurately. To address this issue, we propose SHADE (Soft-Hybrid Alphabet Dynamic Estimator), a simple and interpretable estimator that combines Generalized Good-Turing coverage with a heat-kernel trace of the normalized Laplacian constructed from an entailment-weighted graph over sampled responses. The estimated coverage adaptively determines the fusion rule: under high coverage, SHADE uses a convex combination of the two signals, while under low coverage it applies a LogSumExp fusion to emphasize missing or weakly observed semantic modes. A finite-sample correction is then introduced to stabilize the resulting cardinality estimate before converting it into a coverage-adjusted semantic entropy score. Experiments on pooled semantic alphabet-size estimation against large-sample references and on QA incorrectness detection show that SHADE achieves the strongest improvements in the most sample-limited regime, while the performance gap narrows as the number of samples increases. These results suggest that hybrid semantic occupancy estimation is particularly beneficial when black-box uncertainty quantification must operate under tight sampling budgets.

翻译：本文研究在黑盒访问条件下（即每个查询仅能采样少量响应）的大语言模型（LLM）不确定性量化问题。在此设定下，估计有效语义字母表大小（即采样响应中表达的不同语义数量）为下游风险提供了有效代理指标。然而，当样本量较小时，基于频率的估计器容易低估稀有语义模式，而仅依赖图谱谱量指标难以准确估计语义占有量。为解决该问题，我们提出SHADE（软混合字母表动态估计器）——一种简洁且可解释的估计器，它通过将广义Good-Turing覆盖与基于采样响应间蕴含加权图构建的归一化拉普拉斯热核迹相结合。估计的覆盖度自适应决定融合规则：在高覆盖度下，SHADE采用两个信号的凸组合；在低覆盖度下，则运用对数求和指数融合以强调缺失或弱观测到的语义模式。随后引入有限样本校正以稳定所得基数估计，再将其转化为经覆盖度调整的语义熵分数。通过针对大规模样本参考的池化语义字母表大小估计实验及问答错误检测实验表明：SHADE在样本最受限场景中取得最显著改进，且随着样本量增加性能差距逐渐缩小。这些结果表明，当黑盒不确定性量化需在严格采样预算约束下运行时，混合语义占有量估计尤其具有价值。