Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domainspecific complexity poses challenges for generalist models. In this work, we focus on the domain of autoimmunity, where the main entities of interest are autoimmune diseases, autoantibodies (i.e., molecules that may mark or cause these diseases), their molecular targets, their location in the body, and their associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed, where we manually annotated entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and secondly, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after finetuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at https://github.com/f-maury/AAbAAC.
翻译:尽管深度学习和大型语言模型推动了信息抽取技术的进步,但在高度专业化的生物医学领域,由于领域特定复杂性对通用模型构成挑战,性能差距依然存在。本工作聚焦自身免疫领域,其中关键实体包括自身免疫疾病、自身抗体(即可能标记或引发这些疾病的分子)、其分子靶标、体内定位以及相关临床体征。我们在此提出AAbAAC(自身抗体与自身免疫标注语料库),该语料库包含从PubMed精选的115篇摘要,并对实体及其关系进行了人工标注。首先,利用AAbAAC评估了多种方法在命名实体识别(NER)任务上的表现;其次,对其进行了NER模型的微调。本研究证明了AAbAAC在自身免疫领域信息抽取中的实用性,显示微调后NER性能的预期提升。这凸显了小规模标注工作对专业领域的价值,并有助于自身免疫的计算研究。AAbAAC语料库可通过https://github.com/f-maury/AAbAAC获取。