Business Intelligence (BI) is crucial in modern enterprises and billion-dollar business. Traditionally, technical experts like database administrators would manually prepare BI-models (e.g., in star or snowflake schemas) that join tables in data warehouses, before less-technical business users can run analytics using end-user dashboarding tools. However, the popularity of self-service BI (e.g., Tableau and Power-BI) in recent years creates a strong demand for less technical end-users to build BI-models themselves. We develop an Auto-BI system that can accurately predict BI models given a set of input tables, using a principled graph-based optimization problem we propose called \textit{k-Min-Cost-Arborescence} (k-MCA), which holistically considers both local join prediction and global schema-graph structures, leveraging a graph-theoretical structure called \textit{arborescence}. While we prove k-MCA is intractable and inapproximate in general, we develop novel algorithms that can solve k-MCA optimally, which is shown to be efficient in practice with sub-second latency and can scale to the largest BI-models we encounter (with close to 100 tables). Auto-BI is rigorously evaluated on a unique dataset with over 100K real BI models we harvested, as well as on 4 popular TPC benchmarks. It is shown to be both efficient and accurate, achieving over 0.9 F1-score on both real and synthetic benchmarks.
翻译:商业智能(BI)在现代企业和价值数十亿美元的行业中至关重要。传统上,数据库管理员等技术专家会手动准备BI模型(例如星型或雪花型模式),将数据仓库中的表连接起来,之后非技术型业务用户才能使用终端仪表盘工具进行分析。然而,近年来自助式BI(如Tableau和Power-BI)的普及,对非技术型终端用户自行构建BI模型产生了强烈需求。我们开发了Auto-BI系统,该系统能根据给定的输入表集合准确预测BI模型,其核心是一个我们提出的基于图的原则性优化问题——\textit{k-最小代价树状图}(k-MCA),该问题通过利用名为\textit{树状图}的图论结构,综合考量局部连接预测与全局模式图结构。我们证明了k-MCA问题在一般情况下是难解且不可近似化的,但开发了能够最优求解k-MCA的新算法,该算法在实际应用中表现出亚秒级延迟的高效性,并可扩展至我们遇到的最大规模BI模型(接近100张表)。Auto-BI在我们采集的包含超过10万个真实BI模型的独特数据集以及4个流行TPC基准测试上进行了严格评估。结果表明,该系统兼具高效性与准确性,在真实和合成基准测试上均达到了0.9以上的F1分数。