We present TIGERScore, a \textbf{T}rained metric that follows \textbf{I}nstruction \textbf{G}uidance to perform \textbf{E}xplainable, and \textbf{R}eference-free evaluation over a wide spectrum of text generation tasks. Different from other automatic evaluation methods that only provide arcane scores, TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mistakes in the generated text. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct which covers 6 text generation tasks and 23 text generation datasets. The dataset consists of 42K quadruple in the form of (instruction, input, system output $\rightarrow$ error analysis). We collected the `system outputs' through from a large variety of models to cover different types of errors. To quantitatively assess our metric, we evaluate its correlation with human ratings on 5 held-in datasets, 2 held-out datasets and show that TIGERScore can achieve the open-source SoTA correlation with human ratings across these datasets and almost approaches GPT-4 evaluator. As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. To further qualitatively assess the rationale generated by our metric, we conduct human evaluation on the generated explanations and found that the explanations are 70.8\% accurate. Through these experimental results, we believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task.
翻译:我们提出了TIGERScore,一种遵循指令引导的可解释、无参考评估度量方法,可广泛应用于各类文本生成任务。与仅提供抽象评分值的传统自动评估方法不同,TIGERScore通过自然语言指令驱动,能够分析生成文本中的具体错误。该度量基于LLaMA-2架构,通过我们精心构建的指令微调数据集MetricInstruct进行训练,该数据集覆盖6类文本生成任务与23个文本生成数据集,共包含42,000条四元组(指令、输入、系统输出→错误分析)。系统输出样本来源于多种模型,以涵盖不同类型错误。为定量评估该度量,我们在5个内部数据集与2个外部数据集上计算其与人工评分的相关性,结果显示TIGERScore在这些数据集上达到了开源模型中的最佳相关性水平,且接近GPT-4评估器性能。作为无参考度量,其相关性甚至超越现有最优的有参考度量。为进一步定性评估度量生成的解释质量,我们开展了人工评估,结果显示生成的解释准确率达70.8%。这些实验结果表明,TIGERScore展示了构建通用可解释度量以评估任意文本生成任务的可行性。