This paper introduces the Pandemic PACT Advanced Categorisation Engine (PPACE) along with its associated dataset. PPACE is a fine-tuned model developed to automatically classify research abstracts from funded biomedical projects according to WHO-aligned research priorities. This task is crucial for monitoring research trends and identifying gaps in global health preparedness and response. Our approach builds on human-annotated projects, which are allocated one or more categories from a predefined list. A large language model is then used to generate `rationales' explaining the reasoning behind these annotations. This augmented data, comprising expert annotations and rationales, is subsequently used to fine-tune a smaller, more efficient model. Developed as part of the Pandemic PACT project, which aims to track and analyse research funding and clinical evidence for a wide range of diseases with outbreak potential, PPACE supports informed decision-making by research funders, policymakers, and independent researchers. We introduce and release both the trained model and the instruction-based dataset used for its training. Our evaluation shows that PPACE significantly outperforms its baselines. The release of PPACE and its associated dataset offers valuable resources for researchers in multilabel biomedical document classification and supports advancements in aligning biomedical research with key global health priorities.
翻译:本文介绍了大流行病PACT高级分类引擎(PPACE)及其关联数据集。PPACE是一种经过微调的模型,旨在根据世界卫生组织(WHO)制定的研究优先级,对已获资助的生物医学项目的研究摘要进行自动分类。该任务对于监测研究趋势、识别全球卫生准备与响应中的缺口至关重要。我们的方法基于人工标注的项目,这些项目从预定义列表中被分配一个或多个类别。随后利用大语言模型生成解释这些标注背后推理过程的"依据说明"。这种包含专家标注和依据说明的增强数据,被用于微调一个更小、更高效的模型。作为大流行病PACT项目(该项目旨在追踪和分析具有暴发潜力的多种疾病的研究资助和临床证据)的一部分,PPACE通过为研究资助者、政策制定者和独立研究者提供支持,助力基于证据的决策。我们正式发布训练完成的模型及其基于指令的训练数据集。评估结果表明,PPACE显著优于基线模型。PPACE及其关联数据集的发布为多标签生物医学文档分类领域的研究者提供了宝贵资源,并推动生物医学研究与全球卫生关键优先事项的协调统一取得进展。