There is an ongoing need for scalable tools to aid researchers in both retrospective and prospective standardization of discrete entity types -- such as disease names, cell types or chemicals -- that are used in metadata associated with biomedical data. When metadata are not well-structured or precise, the associated data are harder to find and are often burdensome to reuse, analyze or integrate with other datasets due to the upfront curation effort required to make the data usable -- typically through retrospective standardization and cleaning of the (meta)data. With the goal of facilitating the task of standardizing metadata -- either in bulk or in a one-by-one fashion; for example, to support auto-completion of biomedical entities in forms -- we have developed an open-source tool called text2term that maps free-text descriptions of biomedical entities to controlled terms in ontologies. The tool is highly configurable and can be used in multiple ways that cater to different users and expertise levels -- it is available on PyPI and can be used programmatically as any Python package; it can also be used via a command-line interface; or via our hosted, graphical user interface-based Web application (https://text2term.hms.harvard.edu); or by deploying a local instance of our interactive application using Docker.
翻译:当前亟需可扩展的工具,以协助研究人员对生物医学数据元数据中使用的离散实体类型(如疾病名称、细胞类型或化学物质)进行回顾性与前瞻性标准化。当元数据结构不完善或表述不精确时,相关数据将更难被发现,且由于需要预先投入大量数据整理工作(通常通过对(元)数据进行回顾性标准化与清洗才能使数据可用),这些数据在复用、分析或与其他数据集整合时往往面临巨大障碍。为促进元数据标准化任务——无论是批量处理还是逐条处理(例如支持表单中生物医学实体的自动补全)——我们开发了一款名为text2term的开源工具,可将生物医学实体的自由文本描述映射至本体中的受控术语。该工具具有高度可配置性,可通过多种方式满足不同用户和专业知识水平的需求:它已在PyPI发布,可作为标准Python包进行编程调用;也可通过命令行界面使用;或通过我们托管的基于图形用户界面的Web应用程序(https://text2term.hms.harvard.edu)进行操作;用户还可通过Docker部署本地交互式应用实例。