Data comes in many forms. From a shallow perspective, they can be viewed as being either in structured (e.g., as a relation, as key-value pairs) or unstructured (e.g., text, image) formats. So far, machines have been fairly good at processing and reasoning over structured data that follows a precise schema. However, the heterogeneity of data poses a significant challenge on how well diverse categories of data can be meaningfully stored and processed. Data Integration, a crucial part of the data engineering pipeline, addresses this by combining disparate data sources and providing unified data access to end-users. Until now, most data integration systems have leaned on only combining structured data sources. Nevertheless, unstructured data (a.k.a. free text) also contains a plethora of knowledge waiting to be utilized. Thus, in this chapter, we firstly make the case for the integration of textual data, to later present its challenges, state of the art and open problems.
翻译:数据以多种形式存在。从浅层视角来看,它们可以被视为结构化(例如,以关系形式或键值对形式)或非结构化(例如,文本、图像)格式。迄今为止,机器已能够相当熟练地处理并推理遵循精确模式的结构化数据。然而,数据的异构性对如何有意义地存储和处理不同类型数据构成了重大挑战。数据集成作为数据工程流程中的关键组成部分,通过整合分散的数据源并向最终用户提供统一的数据访问来解决这一问题。到目前为止,大多数数据集成系统仅专注于结合结构化数据源。然而,非结构化数据(即自由文本)也蕴含着大量亟待利用的知识。因此,在本章中,我们首先论证文本数据整合的必要性,随后介绍其面临的挑战、当前研究现状以及开放性问题。