Evaluating the quality and variability of text generated by Large Language Models (LLMs) poses a significant, yet unresolved research challenge. Traditional evaluation methods, such as ROUGE and BERTScore, which measure token similarity, often fail to capture the holistic semantic equivalence. This results in a low correlation with human judgments and intuition, which is especially problematic in high-stakes applications like healthcare and finance where reliability, safety, and robust decision-making are highly critical. This work proposes DCR, an automated framework for evaluating and improving the consistency of LLM-generated texts using a divide-conquer-reasoning approach. Unlike existing LLM-based evaluators that operate at the paragraph level, our method employs a divide-and-conquer evaluator (DCE) that breaks down the paragraph-to-paragraph comparison between two generated responses into individual sentence-to-paragraph comparisons, each evaluated based on predefined criteria. To facilitate this approach, we introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score. Beyond the consistency evaluation, we further present a reason-assisted improver (RAI) that leverages the analytical reasons with explanations identified by DCE to generate new responses aimed at reducing these inconsistencies. Through comprehensive and systematic empirical analysis, we show that our approach outperforms state-of-the-art methods by a large margin (e.g., +19.3% and +24.3% on the SummEval dataset) in evaluating the consistency of LLM generation across multiple benchmarks in semantic, factual, and summarization consistency tasks. Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
翻译:评估大语言模型生成文本的质量与变异性是一项重要但尚未解决的研究挑战。传统的评估方法如ROUGE和BERTScore通过测量词元相似度,往往难以捕捉整体语义等效性,导致其与人类判断及直觉的相关性较低,这一问题在医疗和金融等对可靠性、安全性和稳健决策要求极高的高风险应用中尤为突出。本文提出DCR,一个采用分治推理方法自动评估和提升大语言模型生成文本一致性的框架。与现有基于大语言模型的段落级评估器不同,我们的方法采用分治评估器,将两个生成响应之间的段落与段落比较分解为独立的句子与段落比较,每个比较基于预定义标准进行评估。为支持该方法,我们引入自动度量转换器,将DCE输出转换为可解释的数值评分。除一致性评估外,我们进一步提出理由辅助改进器,利用DCE识别的分析原因和解释,生成旨在减少这些不一致性的新响应。通过全面系统的实证分析,我们证明该方法在语义一致性、事实一致性和摘要一致性任务等多个基准测试中,在评估大语言模型生成的一致性方面以较大幅度优于现有最优方法(例如在SummEval数据集上分别提升+19.3%和+24.3%)。我们的方法还减少了近90%的输出不一致性,显示出有效缓解幻觉的潜力。