Biomedical Natural Language Processing (NLP) tends to become cumbersome for most researchers, frequently due to the amount and heterogeneity of text to be processed. To address this challenge, the industry is continuously developing highly efficient tools and creating more flexible engineering solutions. This work presents the integration between industry data engineering solutions for efficient data processing and academic systems developed for Named Entity Recognition (LasigeUnicage\_NER) and Relation Extraction (BiOnt). Our design reflects an integration of those components with external knowledge in the form of additional training data from other datasets and biomedical ontologies. We used this pipeline in the 2022 LitCoin NLP Challenge, where our team LasigeUnicage was awarded the 7th Prize out of approximately 200 participating teams, reflecting a successful collaboration between the academia (LASIGE) and the industry (Unicage). The software supporting this work is available at \url{https://github.com/lasigeBioTM/Litcoin-Lasige_Unicage}.
翻译:生物医学自然语言处理(NLP)因需处理文本数量大且异质性强,对大多数研究人员而言往往变得繁琐复杂。为应对这一挑战,工业界持续开发高效工具并创建更灵活的工程解决方案。本研究展示了工业数据工程解决方案(用于高效数据处理)与学术系统(LasigeUnicage_NER命名实体识别系统和BiOnt关系抽取系统)的集成方案。我们的设计将这些组件与外部知识(来自其他数据集和生物医学本体的补充训练数据)相融合。我们采用该流水线参加了2022年LitCoin NLP挑战赛,团队LasigeUnicage在约200支参赛队伍中荣获第七名,这体现了学术界(LASIGE)与工业界(Unicage)的成功合作。本研究所用软件可在\url{https://github.com/lasigeBioTM/Litcoin-Lasige_Unicage}获取。