Many operational systems collect high-dimensional timeseries data about users/systems on key performance metrics. For instance, ISPs, content distribution networks, and video delivery services collect quality of experience metrics for user sessions associated with metadata (e.g., location, device, ISP). Over such historical data, operators and data analysts often need to run retrospective analysis; e.g., analyze anomaly detection algorithms, experiment with different configurations for alerts, evaluate new algorithms, and so on. We refer to this class of workloads as alternative history analysis for operational datasets. We show that in such settings, traditional data processing solutions (e.g., data warehouses, sampling, sketching, big-data systems) either pose high operational costs or do not guarantee accurate replay. We design and implement a system, called AHA (Alternative History Analytics), that overcomes both challenges to provide cost efficiency and fidelity for high-dimensional data. The design of AHA is based on analytical and empirical insights about such workloads: 1) the decomposability of underlying statistics; 2) sparsity in terms of active number of subpopulations over attribute-value combinations; and 3) efficiency structure of aggregation operations in modern analytics databases. Using multiple real-world datasets and as well as case-studies on production pipelines at a large video analytics company, we show that AHA provides 100% accuracy for a broad range of downstream tasks and up to 85x lower total cost of ownership (i.e., compute + storage) compared to conventional methods.
翻译:众多运营系统会针对用户/系统在关键性能指标上收集高维时序数据。例如,互联网服务提供商、内容分发网络和视频传输服务会收集与元数据(如地理位置、设备类型、ISP)关联的用户会话体验质量指标。基于此类历史数据,运营人员与数据分析师常需进行回溯分析,例如:分析异常检测算法、尝试不同的告警配置、评估新算法等。我们将此类工作负载称为运营数据集的替代历史分析。研究表明,在此类场景中,传统数据处理方案(如数据仓库、采样、草图算法、大数据系统)要么导致高昂的运营成本,要么无法保证精确的重放效果。我们设计并实现了名为AHA(替代历史分析系统)的系统,该系统同时克服了这两大挑战,为高维数据提供了成本效益与保真度。AHA的设计基于对此类工作负载的分析性与经验性洞察:1) 底层统计量的可分解性;2) 属性值组合上活跃子群体数量的稀疏性;3) 现代分析数据库中聚合操作的高效结构。通过使用多个真实数据集及某大型视频分析公司生产流水线的案例研究,我们证明AHA在广泛的下游任务中可实现100%的准确度,且相较于传统方法,其总体拥有成本(即计算+存储)最高可降低85倍。