Modern Large Language Models (LLMs) have showcased remarkable prowess in various tasks necessitating sophisticated cognitive behaviors. Nevertheless, a paradoxical performance discrepancy is observed, where these models underperform in seemingly elementary tasks like relation extraction and event extraction due to two issues in conventional evaluation. (1) The imprecision of existing evaluation metrics that struggle to effectively gauge semantic consistency between model outputs and ground truth, and (2) The inherent incompleteness of evaluation benchmarks, primarily due to restrictive human annotation schemas, resulting in underestimated LLM performances. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. This method innovatively utilizes LLMs, fine-tuned through subjective question correction data, to refine matching between model outputs and golden labels. Additionally, by incorporating a Natural Language Inference (NLI) model, SQC-Score enriches golden labels, addressing benchmark incompleteness by acknowledging correct yet previously omitted answers. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics. Utilizing SQC-Score, we conduct a comprehensive evaluation of the state-of-the-art LLMs and provide insights for future research for information extraction. Dataset and associated codes can be accessed at https://github.com/THU-KEG/SQC-Score.
翻译:摘要:现代大型语言模型(LLMs)在需要复杂认知行为的各类任务中展现出卓越能力。然而,传统评估中存在两类问题导致这些模型在关系抽取、事件抽取等看似基础的任务上表现欠佳,呈现矛盾性性能差异:(1)现有评估指标精度不足,难以有效衡量模型输出与真实语义之间的一致性;(2)评估基准本身存在固有缺失——主要源于限制性人工标注模式——导致LLM性能被低估。受主观题批改原则启发,我们提出新型评估方法SQC-Score。该方法创新性地利用经主观题批改数据微调后的LLM,优化模型输出与标准标签之间的匹配过程。同时,通过引入自然语言推理(NLI)模型,SQC-Score能够扩充标准标签集,将正确但被人工标注遗漏的答案纳入考量,从而解决基准不完整问题。在三个信息抽取任务上的实验表明,SQC-Score相较于基线指标更受人工评估者青睐。基于SQC-Score,我们全面评估了当前最优LLM,并为信息抽取领域未来研究提供了新见解。相关数据集与代码可访问https://github.com/THU-KEG/SQC-Score获取。