We present TIGERScore, a \textbf{T}rained metric that follows \textbf{I}nstruction \textbf{G}uidance to perform \textbf{E}xplainable, and \textbf{R}eference-free evaluation over a wide spectrum of text generation tasks. Different from other automatic evaluation methods that only provide arcane scores, TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mistakes in the generated text. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset MetricInstruct which covers 6 text generation tasks and 23 text generation datasets. The dataset consists of 42K quadruple in the form of (instruction, input, system output $\rightarrow$ error analysis). We collected the `system outputs' through from a large variety of models to cover different types of errors. To quantitatively assess our metric, we evaluate its correlation with human ratings on 5 held-in datasets, 2 held-out datasets and show that TIGERScore can achieve the open-source SoTA correlation with human ratings across these datasets and almost approaches GPT-4 evaluator. As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. To further qualitatively assess the rationale generated by our metric, we conduct human evaluation on the generated explanations and found that the explanations are 70.8\% accurate. Through these experimental results, we believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task.
翻译:我们提出TIGERScore,一种遵循指令引导、能够对广泛文本生成任务进行可解释且无参考评估的**训练**度量方法。不同于仅提供晦涩分值的传统自动评估方法,TIGERScore通过自然语言指令引导,对生成文本中的错误进行精确定位与分析。我们的度量基于LLaMA-2模型,在精心构建的指令微调数据集MetricInstruct上训练,该数据集覆盖6类文本生成任务与23个文本生成数据集,包含42K条(指令、输入、系统输出→错误分析)四元组。我们通过多种模型收集“系统输出”以覆盖不同类型的错误。为定量评估该度量,我们计算其与人类评分在5个保留数据集和2个开放数据集上的相关性,结果显示TIGERScore能在这些数据集上达到开源模型在人类评分相关性方面的最佳水平,且接近GPT-4评估器的表现。作为无参考度量,其相关性甚至超越现有最佳的有参考度量。为定性评估度量生成的理由,我们开展生成解释的人工评测,发现解释的准确率达70.8%。通过这些实验结果,我们相信TIGERScore展示了构建通用可解释度量以评估任意文本生成任务的可能性。