In recent years, data lakes emerged as away to manage large amounts of heterogeneous data for modern data analytics. One way to prevent data lakes from turning into inoperable data swamps is semantic data management. Some approaches propose the linkage of metadata to knowledge graphs based on the Linked Data principles to provide more meaning and semantics to the data in the lake. Such a semantic layer may be utilized not only for data management but also to tackle the problem of data integration from heterogeneous sources, in order to make data access more expressive and interoperable. In this survey, we review recent approaches with a specific focus on the application within data lake systems and scalability to Big Data. We classify the approaches into (i) basic semantic data management, (ii) semantic modeling approaches for enriching metadata in data lakes, and (iii) methods for ontologybased data access. In each category, we cover the main techniques and their background, and compare latest research. Finally, we point out challenges for future work in this research area, which needs a closer integration of Big Data and Semantic Web technologies.
翻译:近年来,数据湖已成为管理现代数据分析所需大量异构数据的一种方式。防止数据湖退化为不可操作的数据沼泽的一种方法是语义数据管理。一些方法提出基于关联数据原则将元数据链接到知识图谱,从而为数据湖中的数据提供更多含义和语义。这种语义层不仅可以用于数据管理,还可以解决来自异构源的数据集成问题,以使数据访问更具表达力和互操作性。在本综述中,我们重点回顾了近期在数据湖系统中的应用及对大数据可扩展性的方法。我们将这些方法分类为:(i) 基本语义数据管理,(ii) 用于丰富数据湖中元数据的语义建模方法,以及(iii) 基于本体的数据访问方法。在每个类别中,我们覆盖了主要技术及其背景,并比较了最新研究。最后,我们指出了该研究领域未来工作面临的挑战,这需要大数据与语义网技术的更紧密整合。