Accelerating Fresh Data Exploration with Fluid ETL Pipelines

Recently, we have seen an increasing need for fresh data exploration, where data analysts seek to explore the main characteristics or detect anomalies of data being actively collected. In addition to the common challenges in classic data exploration, such as a lack of prior knowledge about the data or the analysis goal, fresh data exploration also demands an ingestion system with sufficient throughput to keep up with rapid data accumulation. However, leveraging traditional Extract-Transform-Load (ETL) pipelines to achieve low query latency can still be extremely resource-intensive as they must conduct an excessive amount of data preprocessing routines (DPRs) (e.g., parsing and indexing) to cover unpredictable data characteristics and analysis goals. To overcome this challenge, we seek to approach it from a different angle: leveraging occasional idle system capacity or cheap preemptive resources (e.g., Amazon Spot Instance) during ingestion. In particular, we introduce a new type of data ingestion system called fluid ETL pipelines, which allow users to start/stop arbitrary DPRs on demand without blocking data ingestion. With fluid ETL pipelines, users can start potentially useful DPRs to accelerate future exploration queries whenever idle/cheap resources are available. Moreover, users can dynamically change which DPRs to run with limited resources to adapt to users' evolving interests. We conducted experiments on a real-world dataset and verified that our vision is viable. The introduction of fluid ETL pipelines also raises new challenges in handling essential tasks, such as ad-hoc query processing, DPR generation, and DPR management. In this paper, we discuss open research challenges in detail and outline potential directions for addressing them.

翻译：近期，我们观察到对新鲜数据探索的需求日益增长——数据分析师需要探索正在持续采集的数据的主要特征或检测异常。除了经典数据探索中常见的挑战（如缺乏对数据或分析目标的先验知识）之外，新鲜数据探索还要求数据摄入系统具备足够的吞吐量以跟上数据的快速累积。然而，利用传统的提取-转换-加载（ETL）管道实现低查询延迟仍会消耗大量资源，因为它们必须执行过多数据预处理例程（DPR）（如解析和索引），以覆盖不可预测的数据特性和分析目标。为应对这一挑战，我们试图从不同角度切入：在数据摄入期间利用偶尔闲置的系统容量或廉价的抢占式资源（例如 Amazon 竞价实例）。具体而言，我们引入了一种新型数据摄入系统，称为流体式ETL管道，它允许用户按需启动/停止任意DPR而不会阻塞数据摄入。借助流体式ETL管道，当有闲置/廉价资源可用时，用户可以启动潜在有用的DPR来加速未来的探索查询。此外，用户能够动态调整在有限资源下运行的DPR，以适应不断变化的兴趣。我们在真实数据集上进行了实验，验证了该愿景的可行性。流体式ETL管道的引入也带来了处理关键任务的新挑战，例如即席查询处理、DPR生成与DPR管理。本文详细讨论了开放的研究挑战，并概述了应对这些挑战的潜在方向。