While generative models, especially large language models (LLMs), are ubiquitous in today's world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as measures of incorrectness. In addition to achieving the guarantees as before, e-scores further provide users with the flexibility of choosing data-dependent tolerance levels while upper bounding size distortion, a post-hoc notion of error. We experimentally demonstrate their efficacy in assessing LLM outputs under different forms of correctness: mathematical factuality and property constraints satisfaction.
翻译:尽管生成模型,尤其是大型语言模型(LLMs),在当今世界无处不在,但评估其(不)正确性的原则性机制仍然有限。利用保形预测框架,先前的研究构建了LLM响应的集合,其中包含错误响应的概率(即误差)被限制在用户定义的容忍水平内。然而,由于这些方法基于p值,它们容易受到p-hacking的影响,即事后选择容忍水平会破坏保证。因此,我们利用e值,以e分数作为不正确性的度量来补充生成模型输出。除了实现与先前相同的保证外,e分数还为用户提供了选择数据依赖容忍水平的灵活性,同时限制了大小失真——一种事后误差概念。我们通过实验证明了它们在评估LLM输出在不同正确性形式(数学事实性和属性约束满足)下的有效性。