Traditionally, discriminative models have been the predominant choice for tasks like document classification and information extraction. These models make predictions that fall into a limited number of predefined classes, facilitating a binary true or false evaluation and enabling the direct calculation of metrics such as the F1 score. However, recent advancements in generative large language models (GLLMs) have prompted a shift in the field due to their enhanced zero-shot capabilities, which eliminate the need for a downstream dataset and computationally expensive fine-tuning. However, evaluating GLLMs presents a challenge as the binary true or false evaluation used for discriminative models is not applicable to the predictions made by GLLMs. This paper introduces a new metric for generative models called ANLS* for evaluating a wide variety of tasks, including information extraction and classification tasks. The ANLS* metric extends existing ANLS metrics as a drop-in-replacement and is still compatible with previously reported ANLS scores. An evaluation of 7 different datasets, and more than 20 different GLLMs together with 3 different prompting methods using the ANLS* metric is also provided, demonstrating the importance of the proposed metric. We also benchmark a novel approach to generate prompts for documents, called SFT, against other prompting techniques such as LATIN. In almost all cases, SFT outperforms other techniques and improves the state-of-the-art, sometimes by as much as $10$ percentage points. Sources are available at https://github.com/deepopinion/anls_star_metric
翻译:传统上,判别式模型一直是文档分类和信息抽取等任务的主流选择。这类模型的预测结果属于有限数量的预定义类别,便于进行二元真伪评估,并可直接计算F1分数等指标。然而,生成式大语言模型(GLLMs)的最新进展凭借其增强的零样本能力推动了该领域的转变,这种能力消除了对下游数据集和计算成本高昂的微调的需求。然而,评估GLLMs面临挑战,因为用于判别式模型的二元真伪评估方法不适用于GLLMs的预测结果。本文提出了一种名为ANLS*的新型生成式模型评估指标,适用于信息抽取和分类等多种任务。ANLS*指标作为即插即用的替代方案对现有ANLS指标进行了扩展,且仍与先前报告的ANLS分数兼容。研究还使用ANLS*指标对7个不同数据集、20多种GLLMs及3种提示方法进行了评估,证明了所提指标的重要性。我们同时将一种名为SFT的新型文档提示生成方法与LATIN等其他提示技术进行了基准测试。在几乎所有案例中,SFT均优于其他技术,有时能提升高达$10$个百分点的性能表现,从而推进了当前最优水平。相关资源已发布于https://github.com/deepopinion/anls_star_metric。