Traditionally, discriminative models have been the predominant choice for tasks like document classification and information extraction. These models make predictions that fall into a limited number of predefined classes, facilitating a binary true or false evaluation and enabling the direct calculation of metrics such as the F1 score. However, recent advancements in generative large language models (GLLMs) have prompted a shift in the field due to their enhanced zero-shot capabilities, which eliminate the need for a downstream dataset and computationally expensive fine-tuning. However, evaluating GLLMs presents a challenge as the binary true or false evaluation used for discriminative models is not applicable to the predictions made by GLLMs. This paper introduces a new metric for generative models called ANLS* for evaluating a wide variety of tasks, including information extraction and classification tasks. The ANLS* metric extends existing ANLS metrics as a drop-in-replacement and is still compatible with previously reported ANLS scores. An evaluation of 7 different datasets and 3 different GLLMs using the ANLS* metric is also provided, demonstrating the importance of the proposed metric. We also benchmark a novel approach to generate prompts for documents, called SFT, against other prompting techniques such as LATIN. In 15 out of 21 cases, SFT outperforms other techniques and improves the state-of-the-art, sometimes by as much as $15$ percentage points. Sources are available at https://github.com/deepopinion/anls_star_metric
翻译:传统上,判别模型一直是文档分类和信息提取等任务的首选。这类模型仅对有限的预定义类别进行预测,便于进行二元真/假评估,并能直接计算F1分数等指标。然而,近年来生成式大语言模型(generative large language models, GLLMs)的进步因其增强的零样本能力推动了该领域的转变,这些能力不再需要下游数据集和计算昂贵的微调。然而,评估GLLMs面临挑战:用于判别模型的二元真/假评估不适用于GLLMs的预测结果。本文提出了一种用于生成模型的新度量——ANLS*,用于评估包括信息提取和分类任务在内的广泛任务。ANLS*度量作为现有ANLS度量的直接替换扩展,并能与先前报告的ANLS分数兼容。本文还提供了使用ANLS*度量对7个不同数据集和3个不同GLLMs的评估,证明了所提度量的重要性。此外,我们还将一种新颖的文档提示生成方法——SFT(结构化微调提示),与LATIN等其他提示技术进行了基准比较。在21个案例中,SFT有15个表现优于其他技术,并提升了当前最佳性能,有时改进幅度高达15个百分点。源代码可见于https://github.com/deepopinion/anls_star_metric