A food composition knowledge base, which stores the essential phyto-, micro-, and macro-nutrients of foods is useful for both research and industrial applications. Although many existing knowledge bases attempt to curate such information, they are often limited by time-consuming manual curation processes. Outside of the food science domain, natural language processing methods that utilize pre-trained language models have recently shown promising results for extracting knowledge from unstructured text. In this work, we propose a semi-automated framework for constructing a knowledge base of food composition from the scientific literature available online. To this end, we utilize a pre-trained BioBERT language model in an active learning setup that allows the optimal use of limited training data. Our work demonstrates how human-in-the-loop models are a step toward AI-assisted food systems that scale well to the ever-increasing big data.
翻译:食品成分知识库存储了食物中必需的植物化学、微量及宏量营养素信息,对于科研和工业应用均具有重要价值。尽管现有许多知识库尝试整理此类信息,但往往受限于耗时的人工整理流程。在食品科学领域之外,基于预训练语言模型的自然语言处理方法近期在从非结构化文本中抽取知识方面展现出良好前景。本研究提出了一种半自动化框架,用于从在线科学文献中构建食品成分知识库。为此,我们采用主动学习机制下的预训练BioBERT语言模型,以优化有限训练数据的利用效率。本工作展示了人机协同模型如何推动AI辅助食品系统的发展,使其能够有效应对持续增长的大数据挑战。