A crucial component in the curation of KB for a scientific domain is information extraction from tables in the domain's published articles -- tables carry important information (often numeric), which must be adequately extracted for a comprehensive machine understanding of an article. Existing table extractors assume prior knowledge of table structure and format, which may not be known in scientific tables. We study a specific and challenging table extraction problem: extracting compositions of materials (e.g., glasses, alloys). We first observe that materials science researchers organize similar compositions in a wide variety of table styles, necessitating an intelligent model for table understanding and composition extraction. Consequently, we define this novel task as a challenge for the ML community and create a training dataset comprising 4,408 distantly supervised tables, along with 1,475 manually annotated dev and test tables. We also present DiSCoMaT, a strong baseline geared towards this specific task, which combines multiple graph neural networks with several task-specific regular expressions, features, and constraints. We show that DiSCoMaT outperforms recent table processing architectures by significant margins.
翻译:科学领域知识库构建的关键环节之一是从该领域已发表文章的表格中提取信息——表格承载着重要信息(通常为数值数据),必须充分提取才能实现对文章的全面机器理解。现有表格提取器假设预先知晓表格结构与格式,但这在科学表格中往往不可知。我们研究了一个特定且具有挑战性的表格提取问题:提取材料(如玻璃、合金)的组成成分。首先观察到材料科学研究人员会以多种多样的表格样式组织相似成分,这需借助智能模型实现表格理解与成分提取。因此,我们将这一新任务定义为机器学习领域的挑战,并构建了一个包含4,408个远监督表格的训练数据集,以及1,475个人工标注的开发和测试表格。同时提出DiSCoMaT——一个针对该特定任务的强基线模型,通过结合多个图神经网络与若干任务特定的正则表达式、特征和约束条件。实验表明,DiSCoMaT在显著程度上优于近期表格处理架构。