Large language models (LLMs) inherently operate over a large generation space, yet conventional usage typically reports the most likely generation (MLG) as a point prediction, which underestimates the model's capability: although the top-ranked response can be incorrect, valid answers may still exist within the broader output space and can potentially be discovered through repeated sampling. This observation motivates moving from point prediction to set-valued prediction, where the model produces a set of candidate responses rather than a single MLG. In this paper, we propose a principled framework for set-valued prediction, which provides feasibility-aware coverage guarantees. We show that, given the finite-sampling nature of LLM generation, coverage is not always achievable: even with multiple samplings, LLMs may fail to yield an acceptable response for certain questions within the sampled candidate set. To address this, we establish a minimum achievable risk level (MRL), below which statistical coverage guarantees cannot be satisfied. Building on this insight, we then develop a data-driven calibration procedure that constructs prediction sets from sampled responses by estimating a rigorous threshold, ensuring that the resulting set contains a correct answer with a desired probability whenever the target risk level is feasible. Extensive experiments on six language generation tasks with five LLMs demonstrate both the statistical validity and the predictive efficiency of our framework.
翻译:大语言模型(LLMs)本质上运行于庞大的生成空间,但常规用法通常将最可能生成(MLG)作为点预测输出,这低估了模型能力:尽管排名最高的响应可能不正确,但更广泛的输出空间中仍可能存在有效答案,并可通过重复采样发现。这一观察促使我们从点预测转向集合预测——模型生成一组候选响应而非单一MLG。本文提出了一种原则性的集合预测框架,提供可行性感知覆盖保证。研究表明,由于大语言模型生成的有限采样特性,覆盖并不总能实现:即使在多次采样后,LLMs仍可能无法在采样候选集中为某些问题生成可接受的响应。为解决此问题,我们确立了最小可实现风险水平(MRL),低于此水平则无法满足统计覆盖保证。基于此见解,我们进一步开发了一种数据驱动的校准程序,通过估计严格阈值从采样响应中构建预测集,确保当目标风险水平可行时,所得集合以期望概率包含正确回答。在五个LLM的六项语言生成任务上的大量实验证明了我们框架的统计有效性和预测效率。