This paper describes the results of SemEval 2023 task 7 -- Multi-Evidence Natural Language Inference for Clinical Trial Data (NLI4CT) -- consisting of 2 tasks, a Natural Language Inference (NLI) task, and an evidence selection task on clinical trial data. The proposed challenges require multi-hop biomedical and numerical reasoning, which are of significant importance to the development of systems capable of large-scale interpretation and retrieval of medical evidence, to provide personalized evidence-based care. Task 1, the entailment task, received 643 submissions from 40 participants, and Task 2, the evidence selection task, received 364 submissions from 23 participants. The tasks are challenging, with the majority of submitted systems failing to significantly outperform the majority class baseline on the entailment task, and we observe significantly better performance on the evidence selection task than on the entailment task. Increasing the number of model parameters leads to a direct increase in performance, far more significant than the effect of biomedical pre-training. Future works could explore the limitations of large models for generalization and numerical inference, and investigate methods to augment clinical datasets to allow for more rigorous testing and to facilitate fine-tuning. We envisage that the dataset, models, and results of this task will be useful to the biomedical NLI and evidence retrieval communities. The dataset, competition leaderboard, and website are publicly available.
翻译:本文介绍了SemEval 2023任务7——临床试验数据的多证据自然语言推理(NLI4CT)的结果。该任务包含两个子任务:基于临床试验数据的自然语言推理(NLI)任务和证据选择任务。所提出的挑战需要多跳生物医学与数值推理能力,这对开发能够大规模解读与检索医学证据的系统至关重要,从而为个性化循证医学提供支持。任务1(蕴含判断任务)共收到来自40个参赛者的643份提交,任务2(证据选择任务)则收到来自23个参赛者的364份提交。这些任务具有较高挑战性,多数提交系统在蕴含判断任务上未能显著优于多数类基线;同时,观察到证据选择任务的性能显著优于蕴含判断任务。增加模型参数数量可直接提升性能,其效果远优于生物医学预训练的影响。未来工作可探索大模型在泛化与数值推理方面的局限性,并研究增强临床数据集的方法,以实现更严格的测试并促进微调。我们预期该任务的数据集、模型及结果将对生物医学自然语言推理与证据检索社区有所助益。数据集、竞赛排行榜及网站均已公开提供。