Quantities are essential in documents to describe factual information. They are ubiquitous in application domains such as finance, business, medicine, and science in general. Compared to other information extraction approaches, interestingly only a few works exist that describe methods for a proper extraction and representation of quantities in text. In this paper, we present such a comprehensive quantity extraction framework from text data. It efficiently detects combinations of values and units, the behavior of a quantity (e.g., rising or falling), and the concept a quantity is associated with. Our framework makes use of dependency parsing and a dictionary of units, and it provides for a proper normalization and standardization of detected quantities. Using a novel dataset for evaluation, we show that our open source framework outperforms other systems and -- to the best of our knowledge -- is the first to detect concepts associated with identified quantities. The code and data underlying our framework are available at https://github.com/vivkaz/CQE.
翻译:数量在文档中对于描述事实信息至关重要。它们普遍存在于金融、商业、医学以及一般科学等应用领域。与其他信息抽取方法相比,有趣的是,目前仅有少数研究描述了如何从文本中恰当地抽取并表示数量。本文提出了一种从文本数据中综合抽取数量的框架。该框架能够高效检测数值与单位的组合、数量的变化趋势(例如上升或下降)以及数量所关联的概念。我们的框架利用依存句法分析和单位词典,并对检测到的数量进行适当的归一化与标准化处理。通过使用新颖的数据集进行评估,我们展示了该开源框架的性能优于其他系统,并且——据我们所知——它是首个能够检测与已识别数量相关联概念的系统。本框架的代码与数据可在 https://github.com/vivkaz/CQE 获取。