With the proliferation of research means and computational methodologies, published biomedical literature is growing exponentially in numbers and volume. Cancer cell lines are frequently used models in biological and medical research that are currently applied for a wide range of purposes, from studies of cellular mechanisms to drug development, which has led to a wealth of related data and publications. Sifting through large quantities of text to gather relevant information on the cell lines of interest is tedious and extremely slow when performed by humans. Hence, novel computational information extraction and correlation mechanisms are required to boost meaningful knowledge extraction. In this work, we present the design, implementation and application of a novel data extraction and exploration system. This system extracts deep semantic relations between textual entities from scientific literature to enrich existing structured clinical data in the domain of cancer cell lines. We introduce a new public data exploration portal, which enables automatic linking of genomic copy number variants plots with ranked, related entities such as affected genes. Each relation is accompanied by literature-derived evidences, allowing for deep, yet rapid, literature search, using existing structured data as a springboard. Our system is publicly available on the web at https://cancercelllines.org
翻译:随着研究手段和计算方法的普及,已发表的生物医学文献在数量和规模上呈指数级增长。癌细胞系作为生物医学研究中常用的模型,目前已广泛应用于从细胞机制研究到药物开发的多种场景,由此产生了大量相关数据和出版物。人工筛选海量文本来收集目标细胞系的相关信息不仅繁琐且效率极低。因此,亟需新型计算型信息提取与关联机制以促进有意义的知识提取。本研究介绍了一种新型数据提取与探索系统的设计、实现与应用。该系统从科学文献中提取文本实体间的深层语义关系,以丰富现有癌细胞系领域的结构化临床数据。我们开发了一个新的公开数据探索门户,能够实现基因组拷贝数变异图与受影响基因等排序相关实体的自动关联。每条关系均附有文献来源证据,支持以现有结构化数据为跳板进行深入而快速的文献检索。该系统可通过网站 https://cancercelllines.org 公开访问。