Many data insight questions can be viewed as searching in a large space of tables and finding important ones, where the notion of importance is defined in some adhoc user defined manner. This paper presents Holistic Cube Analysis (HoCA), a framework that augments the capabilities of relational queries for such problems. HoCA first augments the relational data model and introduces a new data type AbstractCube, defined as a function which maps a region-features pair to a relational table (a region is a tuple which specifies values of a set of dimensions). AbstractCube provides a logical form of data, and HoCA operators are cube-to-cube transformations. We describe two basic but fundamental HoCA operators, cube crawling and cube join (with many possible extensions). Cube crawling explores a region space, and outputs a cube that maps regions to signal vectors. Cube join, in turn, is critical for composition, allowing one to join information from different cubes for deeper analysis. Cube crawling introduces two novel programming features, (programmable) Region Analysis Models (RAMs) and Multi-Model Crawling. Crucially, RAM has a notion of population features, which allows one to go beyond only analyzing local features at a region, and program region-population analysis that compares region and population features, capturing a large class of importance notions. HoCA has a rich algorithmic space, such as optimizing crawling and join performance, and physical design of cubes. We have implemented and deployed HoCA at Google. Our early HoCA offering has attracted more than 30 teams building applications with it, across a diverse spectrum of fields including system monitoring, experimentation analysis, and business intelligence. For many applications, HoCA empowers novel and powerful analyses, such as instances of recurrent crawling, which are challenging to achieve otherwise.
翻译:许多数据洞察问题可归结为在大规模表格空间中搜索关键表格,其中"重要性"概念由用户以特定方式定义。本文提出全维立方分析框架(Holistic Cube Analysis,简称HoCA),该框架通过增强关系查询的能力来解决此类问题。HoCA首先扩展关系数据模型,引入新型数据类型"抽象立方体"(AbstractCube),将其定义为从"区域-特征对"映射到关系表的函数(其中区域是指定若干维度值的元组)。抽象立方体提供了数据的逻辑形态,HoCA算子则实现立方体间的转换。我们描述了两种基础但核心的HoCA算子:立方体爬取(cube crawling)与立方体连接(cube join)(两者均有多种扩展形式)。立方体爬取探索区域空间,输出将区域映射至信号向量的立方体;立方体连接则通过组合不同立方体的信息以实现深度分析,对构成分析流程至关重要。立方体爬取引入了两项新颖的编程特性:(可编程的)区域分析模型(Region Analysis Models,简称RAM)与多模型爬取。关键之处在于,RAM引入了群体特征的概念,使得分析不再局限于区域局部特征,而是能通过编程实现区域-群体比较分析,从而覆盖广泛的重要性度量需求。HoCA具有丰富的算法空间,包括优化爬取与连接性能、以及立方体的物理设计。我们已在谷歌部署HoCA,早期版本已吸引超过30个团队基于其构建应用,涵盖系统监控、实验分析与商业智能等多元化领域。对于许多应用场景,HoCA赋予了新颖且强大的分析能力(如循环爬取实例),而这些分析通过其他方式难以实现。