Modern applications commonly need to manage dataset types composed of heterogeneous data and schemas, making it difficult to access them in an integrated way. A single data store to manage heterogeneous data using a common data model is not effective in such a scenario, which results in the domain data being fragmented in the data stores that best fit their storage and access requirements (e.g., NoSQL, relational DBMS, or HDFS). Besides, organization workflows independently consume these fragments, and usually, there is no explicit link among the fragments that would be useful to support an integrated view. The research challenge tackled by this work is to provide the means to query heterogeneous data residing on distinct data repositories that are not explicitly connected. We propose a federated database architecture by providing a single abstract global conceptual schema to users, allowing them to write their queries, encapsulating data heterogeneity, location, and linkage by employing: (i) meta-models to represent the global conceptual schema, the remote data local conceptual schemas, and mappings among them; (ii) provenance to create explicit links among the consumed and generated data residing in separate datasets. We evaluated the architecture through its implementation as a polystore service, following a microservice architecture approach, in a scenario that simulates a real case in Oil \& Gas industry. Also, we compared the proposed architecture to a relational multidatabase system based on foreign data wrappers, measuring the user's cognitive load to write a query (or query complexity) and the query processing time. The results demonstrated that the proposed architecture allows query writing two times less complex than the one written for the relational multidatabase system, adding an excess of no more than 30% in query processing time.
翻译:现代应用通常需要管理由异构数据和模式组成的数据集类型,这使得难以以集成方式访问它们。在这种情况下,使用单一数据存储库通过通用数据模型管理异构数据并不有效,导致领域数据分散在最适合其存储和访问需求的数据存储中(例如,NoSQL、关系型数据库管理系统或HDFS)。此外,组织工作流独立地消费这些数据片段,且这些片段之间通常缺乏显式链接,而这类链接本可用于支持集成视图。本研究应对的挑战是提供查询位于不同数据存储库中且未显式连接的异构数据的手段。我们提出一种联邦数据库架构,通过向用户提供单一的抽象全局概念模式,使他们能够编写查询,同时封装数据异构性、位置和链接关系。该方案采用:(i)元模型来表示全局概念模式、远程数据本地概念模式及其映射关系;(ii)溯源机制来创建分散在不同数据集中的消费数据与生成数据之间的显式链接。我们通过将该架构实现为遵循微服务架构方法的多元存储服务,在模拟石油天然气行业真实场景中进行了评估。此外,我们将所提架构与基于外部数据包装器的关系型多数据库系统进行对比,测量了用户编写查询的认知负荷(或查询复杂度)以及查询处理时间。结果表明,所提架构允许编写复杂度仅为关系型多数据库系统一半的查询,且查询处理时间额外增加不超过30%。