Analysts often struggle with analyzing data from multiple tables in a database due to their lack of knowledge on how to join and aggregate the data. To address this, data engineers pre-specify "semantic layers" which include the join conditions and "metrics" of interest with aggregation functions and expressions. However, joins can cause "aggregation consistency issues". For example, analysts may observe inflated total revenue caused by double counting from join fanouts. Existing BI tools rely on heuristics for deduplication, resulting in imprecise and challenging-to-understand outcomes. To overcome these challenges, we propose "weighing" as a core primitive to counteract join fanouts. "Weighing" has been used in various areas, such as market attribution and order management, ensuring metrics consistency (e.g., total revenue remains the same) even for many-to-many joins. The idea is to assign equal weight to each join key group (rather than each tuple) and then distribute the weights among tuples. Implementing weighing techniques necessitates user input; therefore, we recommend a human-in-the-loop framework that enables users to iteratively explore different strategies and visualize the results.
翻译:分析人员常因缺乏对数据库中多表连接与聚合操作的知识,难以对多表数据进行分析。为解决此问题,数据工程师预先定义"语义层",其中包含连接条件以及带有聚合函数与表达式的待分析"指标"。然而,连接操作可能引发"聚合一致性问题"。例如,分析人员可能观察到因连接扇出导致重复计数而造成的总收入虚高。现有商业智能(BI)工具依赖启发式方法进行去重,导致结果不精确且难以理解。为克服这些挑战,我们提出将"加权"作为应对连接扇出的核心原语。"加权"已被广泛应用于市场归因与订单管理等场景,即使在多对多连接情况下也能确保指标一致性(例如总收入保持不变)。其核心思想是对每个连接键组(而非每个元组)赋予相等权重,再将权重在各元组间分配。加权技术的实现需要用户输入;因此,我们推荐采用人机协同框架,使用户能够迭代探索不同策略并可视化结果。