DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and Improvement of Large Language Models

Evaluating the quality and variability of text generated by Large Language Models (LLMs) poses a significant, yet unresolved research challenge. Traditional evaluation methods, such as ROUGE and BERTScore, which measure token similarity, often fail to capture the holistic semantic equivalence. This results in a low correlation with human judgments and intuition, which is especially problematic in high-stakes applications like healthcare and finance where reliability, safety, and robust decision-making are highly critical. This work proposes DCR, an automated framework for evaluating and improving the consistency of LLM-generated texts using a divide-conquer-reasoning approach. Unlike existing LLM-based evaluators that operate at the paragraph level, our method employs a divide-and-conquer evaluator (DCE) that breaks down the paragraph-to-paragraph comparison between two generated responses into individual sentence-to-paragraph comparisons, each evaluated based on predefined criteria. To facilitate this approach, we introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score. Beyond the consistency evaluation, we further present a reason-assisted improver (RAI) that leverages the analytical reasons with explanations identified by DCE to generate new responses aimed at reducing these inconsistencies. Through comprehensive and systematic empirical analysis, we show that our approach outperforms state-of-the-art methods by a large margin (e.g., +19.3% and +24.3% on the SummEval dataset) in evaluating the consistency of LLM generation across multiple benchmarks in semantic, factual, and summarization consistency tasks. Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.

翻译：评估大语言模型生成文本的质量与变异性是一项重要但尚未解决的研究挑战。传统的评估方法如ROUGE和BERTScore通过测量词元相似度，往往难以捕捉整体语义等效性，导致其与人类判断及直觉的相关性较低，这一问题在医疗和金融等对可靠性、安全性和稳健决策要求极高的高风险应用中尤为突出。本文提出DCR，一个采用分治推理方法自动评估和提升大语言模型生成文本一致性的框架。与现有基于大语言模型的段落级评估器不同，我们的方法采用分治评估器，将两个生成响应之间的段落与段落比较分解为独立的句子与段落比较，每个比较基于预定义标准进行评估。为支持该方法，我们引入自动度量转换器，将DCE输出转换为可解释的数值评分。除一致性评估外，我们进一步提出理由辅助改进器，利用DCE识别的分析原因和解释，生成旨在减少这些不一致性的新响应。通过全面系统的实证分析，我们证明该方法在语义一致性、事实一致性和摘要一致性任务等多个基准测试中，在评估大语言模型生成的一致性方面以较大幅度优于现有最优方法（例如在SummEval数据集上分别提升+19.3%和+24.3%）。我们的方法还减少了近90%的输出不一致性，显示出有效缓解幻觉的潜力。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日