We present a novel approach to automating the identification of risk factors for diseases from medical literature, leveraging pre-trained models in the bio-medical domain, while tuning them for the specific task. Faced with the challenges of the diverse and unstructured nature of medical articles, our study introduces a multi-step system to first identify relevant articles, then classify them based on the presence of risk factor discussions and, finally, extract specific risk factor information for a disease through a question-answering model. Our contributions include the development of a comprehensive pipeline for the automated extraction of risk factors and the compilation of several datasets, which can serve as valuable resources for further research in this area. These datasets encompass a wide range of diseases, as well as their associated risk factors, meticulously identified and validated through a fine-grained evaluation scheme. We conducted both automatic and thorough manual evaluation, demonstrating encouraging results. We also highlight the importance of improving models and expanding dataset comprehensiveness to keep pace with the rapidly evolving field of medical research.
翻译:我们提出了一种新颖方法,用于从医学文献中自动识别疾病风险因素。该方法利用生物医学领域的预训练模型,并针对特定任务进行微调。面对医学文章多样化和非结构化性质的挑战,本研究引入了一个多步骤系统:首先识别相关文章,然后根据是否包含风险因素讨论对其进行分类,最后通过问答模型提取特定疾病的具体风险因素信息。我们的贡献包括开发了一个用于自动提取风险因素的综合流程,并编制了多个数据集,这些数据集可作为该领域进一步研究的宝贵资源。这些数据集涵盖了广泛的疾病及其相关风险因素,通过细粒度的评估方案进行了精心识别和验证。我们进行了自动化和全面的人工评估,结果令人鼓舞。我们还强调了改进模型和扩展数据集全面性的重要性,以跟上医学研究领域的快速发展步伐。