Imperfect databases are very common in many applications due to various reasons ranging from data-entry errors, transmission or integration errors, and wrong instruments' readings, to faulty experimental setups leading to incorrect results. The management and query processing of imperfect databases is a very challenging problem as it requires incorporating the data's qualities within the database engine. Even more challenging, the qualities are typically not static and may evolve over time. Unfortunately, most of the state-of-art techniques deal with the data quality problem as an offline task that is in total isolation of the query processing engine (carried out outside the DBMS). Hence, end-users will receive the queries' results with no clue on whether or not the results can be trusted for further analysis and decision making. In this paper, we propose the it "QTrail-DB" system that fundamentally extends the standard DBMSs to support imperfect databases with evolving qualities. QTrail-DB introduces a new quality model based on the new concept of "Quality Trails", which captures the evolution of the data's qualities over time. QTrail-DB extends the relational data model to incorporate the quality trails within the database system. We propose a new query algebra, called "QTrail Algebra", that enables seamless and transparent propagation and derivations of the data's qualities within a query pipeline. As a result, a query's answer will be automatically annotated with quality-related information at the tuple level. QTrail-DB propagation model leverages the thoroughly-studied propagation semantics present in the DB provenance and lineage tracking literature, and thus there is no need for developing a new query optimizer. QTrail-DB is developed within PostgreSQL and experimentally evaluated using real-world datasets to demonstrate its efficiency and practicality.
翻译:不完美数据库在众多应用中十分常见,其成因包括数据录入错误、传输或集成错误、仪器读数偏差,以及导致结果不正确的实验设置故障等。管理和查询不完美数据库是一个极具挑战性的问题,因为这需要将数据质量纳入数据库引擎中。更棘手的是,数据质量通常并非静态不变,而是会随时间演化。遗憾的是,现有大多数技术将数据质量问题视为离线任务,与查询处理引擎完全隔离(在数据库管理系统之外执行)。因此,最终用户在获取查询结果时,无法判断这些结果是否可信任以用于进一步分析与决策。本文提出了“QTrail-DB”系统,它从根本上扩展了标准数据库管理系统,使其能够支持质量演化的不完美数据库。QTrail-DB基于“质量轨迹”这一新概念提出了新的质量模型,该模型能够捕捉数据质量随时间演化的过程。QTrail-DB扩展了关系数据模型,将质量轨迹集成到数据库系统中。我们提出了一种名为“QTrail代数”的新查询代数,能够无缝且透明地在查询管道中传播和推导数据质量。因此,查询答案将在元组级别自动标注质量相关信息。QTrail-DB的传播模型利用了数据库溯源和谱系追踪文献中已深入研究的传播语义,因此无需开发新的查询优化器。QTrail-DB基于PostgreSQL实现,并使用真实数据集进行了实验评估,验证了其高效性与实用性。