Deploying large language models (LLMs) in high-stakes domains requires rigorous uncertainty quantification, yet standard softmax probabilities are often poorly calibrated. We present a systematic study of Adaptive Prediction Sets (APS) applied to next-token prediction in transformer-based models with large vocabularies (greater than 250,000 tokens). Our central contribution is the identification of a coverage-efficiency tradeoff: while naive conformal prediction achieves valid coverage, it produces prediction sets of hundreds of tokens, rendering them uninformative. We propose Vocabulary-Aware Conformal Prediction (VACP), a framework that leverages semantic masking and temperature-adjusted scoring to reduce the effective prediction space while provably maintaining marginal coverage. Experiments on Gemma-2B using SQUAD and WikiText benchmarks demonstrate that VACP achieves 89.7 percent empirical coverage (90 percent target) while reducing the mean prediction set size from 847 tokens to 4.3 tokens -- a 197x improvement in efficiency. We provide a theoretical analysis of vocabulary reduction and release our implementation for reproducibility.
翻译:在高风险领域部署大语言模型(LLMs)需要严格的不确定性量化,然而标准的softmax概率往往校准不佳。本文系统研究了自适应预测集(APS)在基于Transformer的大词汇量模型(超过250,000个词元)下一词预测任务中的应用。我们的核心贡献在于揭示了覆盖效率权衡:虽然朴素的保形预测能够实现有效覆盖,但其生成的预测集包含数百个词元,导致信息量不足。我们提出词汇感知保形预测(VACP)框架,该框架利用语义掩码和温度调整评分来缩减有效预测空间,同时可证明地保持边际覆盖。基于Gemma-2B模型在SQUAD和WikiText基准上的实验表明,VACP实现了89.7%的经验覆盖率(目标为90%),同时将平均预测集大小从847个词元降低至4.3个词元——效率提升达197倍。我们提供了词汇缩减的理论分析,并开源了实现代码以确保可复现性。