With the proliferation of research means and computational methodologies, published biomedical literature is growing exponentially in numbers and volume. As a consequence, in the fields of biological, medical and clinical research, domain experts have to sift through massive amounts of scientific text to find relevant information. However, this process is extremely tedious and slow to be performed by humans. Hence, novel computational information extraction and correlation mechanisms are required to boost meaningful knowledge extraction. In this work, we present the design, implementation and application of a novel data extraction and exploration system. This system extracts deep semantic relations between textual entities from scientific literature to enrich existing structured clinical data in the domain of cancer cell lines. We introduce a new public data exploration portal, which enables automatic linking of genomic copy number variants plots with ranked, related entities such as affected genes. Each relation is accompanied by literature-derived evidences, allowing for deep, yet rapid, literature search, using existing structured data as a springboard. Our system is publicly available on the web at https://cancercelllines.org
翻译:随着研究手段和计算方法的普及,已发表的生物医学文献在数量和篇幅上呈指数级增长。因此,在生物、医学和临床研究领域,领域专家必须从海量科学文本中筛选相关信息。然而,这一过程由人工完成极为繁琐且缓慢。因此,需要新型计算信息提取与关联机制来促进有意义的知识提取。本文介绍了一种新型数据提取与探索系统的设计、实现与应用。该系统从科学文献中提取文本实体间的深层语义关系,以丰富癌细胞系领域的现有结构化临床数据。我们推出一个新的公共数据探索门户,能够将基因组拷贝数变异图谱与受影响的基因等排序关联实体自动关联。每个关系均附有文献来源的证据,从而以现有结构化数据为跳板,实现深入且快速的文献检索。我们的系统可通过网址 https://cancercelllines.org 公开访问。