As LLMs are deployed in high-stakes settings, users must judge the correctness of individual responses, often relying on model-generated justifications such as reasoning chains or explanations. Yet, no standard measure exists for whether these justifications help users distinguish correct answers from incorrect ones. We formalize this idea as error verifiability and propose $v_{\text{bal}}$, a balanced metric that measures whether justifications enable raters to accurately assess answer correctness, validated against human raters who show high agreement. We find that neither common approaches, such as post-training and model scaling, nor more targeted interventions recommended improve verifiability. We introduce two methods that succeed at improving verifiability: reflect-and-rephrase (RR) for mathematical reasoning and oracle-rephrase (OR) for factual QA, both of which improve verifiability by incorporating domain-appropriate external information. Together, our results establish error verifiability as a distinct dimension of response quality that does not emerge from accuracy improvements alone and requires dedicated, domain-aware methods to address.
翻译:随着大语言模型被部署在高风险场景中,用户必须判断单个回答的正确性,通常依赖模型生成的论证(如推理链或解释)来做出判断。然而,目前缺乏标准度量来评估这些论证是否帮助用户区分正确与错误回答。我们将这一概念形式化为错误可验证性,并提出$v_{\text{bal}}$,一种平衡度量,用于衡量论证是否使评分者能够准确评估回答的正确性,并通过对高一致性的人类评分者的验证进行确认。我们发现,常见方法(如后训练和模型扩展)以及推荐的更有针对性的干预措施均未能提升可验证性。我们引入了两种成功提升可验证性的方法:用于数学推理的反思-重述方法和用于事实性问答的预言-重述方法,这两种方法通过融入领域适当的外部信息来提升可验证性。综合来看,我们的研究结果表明,错误可验证性是回答质量的一个独立维度,它不会仅通过准确性的提升而自然出现,需要专门的、领域感知的方法来加以解决。