Objective: Semantic indexing of biomedical literature is usually done at the level of MeSH descriptors with several related but distinct biomedical concepts often grouped together and treated as a single topic. This study proposes a new method for the automated refinement of subject annotations at the level of MeSH concepts. Methods: Lacking labelled data, we rely on weak supervision based on concept occurrence in the abstract of an article, which is also enhanced by dictionary-based heuristics. In addition, we investigate deep learning approaches, making design choices to tackle the particular challenges of this task. The new method is evaluated on a large-scale retrospective scenario, based on concepts that have been promoted to descriptors. Results: In our experiments concept occurrence was the strongest heuristic achieving a macro-F1 score of about 0.63 across several labels. The proposed method improved it further by more than 4pp. Conclusion: The results suggest that concept occurrence is a strong heuristic for refining the coarse-grained labels at the level of MeSH concepts and the proposed method improves it further.
翻译:目的:生物医学文献的语义索引通常在MeSH描述符层面进行,多个相关但不同的生物医学概念常被归为一组并视为单一主题。本研究提出一种基于MeSH概念层面自动细化主题标注的新方法。方法:由于缺乏标注数据,我们依赖于基于概念在文章摘要中出现次数的弱监督方法,并通过基于词典的启发式规则增强。此外,我们研究了深度学习方法,通过设计选择应对该任务的特定挑战。该新方法在一个大规模回顾性场景中基于已被提升为描述符的概念进行评估。结果:在我们的实验中,概念出现次数是最强的启发式规则,在多个标签上实现了约0.63的宏观F1分数。所提方法在此基础上进一步提高了超过4个百分点。结论:结果表明,概念出现次数是细化MeSH概念层面粗粒度标签的强启发式规则,且所提方法可进一步提升其性能。