Data processing engines increasingly leverage distributed file systems for scalable, cost-effective storage. While the Apache Parquet columnar format has become a popular choice for data storage and retrieval, the immutability of Parquet files renders it impractical to meet the demands of frequent updates in contemporary analytical workloads. Log-Structured Tables (LSTs), such as Delta Lake, Apache Iceberg, and Apache Hudi, offer an alternative for scenarios requiring data mutability, providing a balance between efficient updates and the benefits of columnar storage. They provide features like transactions, time-travel, and schema evolution, enhancing usability and enabling access from multiple engines. Moreover, engines like Apache Spark and Trino can be configured to leverage the optimizations and controls offered by LSTs to meet specific business needs. Conventional benchmarks and tools are inadequate for evaluating the transformative changes in the storage layer resulting from these advancements, as they do not allow us to measure the impact of design and optimization choices in this new setting. In this paper, we propose a novel benchmarking approach and metrics that build upon existing benchmarks, aiming to systematically assess LSTs. We develop a framework, LST-Bench, which facilitates effective exploration and evaluation of the collaborative functioning of LSTs and data processing engines through tailored benchmark packages. A package is a mix of use patterns reflecting a target workload; LST-Bench makes it easy to define a wide range of use patterns and combine them into a package, and we include a baseline package for completeness. Our assessment demonstrates the effectiveness of our framework and benchmark packages in extracting valuable insights across diverse environments. The code for LST-Bench is open-sourced and is available at https://github.com/microsoft/lst-bench/ .
翻译:数据处理引擎日益依赖分布式文件系统以实现可扩展且经济高效的存储。尽管Apache Parquet列式格式已成为数据存储与检索的流行选择,但其文件不可变性使其难以满足当代分析工作负载中频繁更新的需求。日志结构表(LST,如Delta Lake、Apache Iceberg和Apache Hudi)为需要数据可变性的场景提供了替代方案,在高效更新与列式存储优势之间达成平衡。这些方案支持事务、时间旅行和模式演进等特性,增强了可用性并支持多引擎访问。此外,Apache Spark和Trino等引擎可通过配置利用LST提供的优化与控制功能,以满足特定业务需求。传统基准测试与工具因无法衡量新环境下设计与优化选择带来的影响,不足以评估这些技术进步在存储层引发的变革性变化。本文提出一种基于现有基准测试的新颖方法论与度量指标,旨在系统评估LST。我们开发了框架LST-Bench,通过定制化基准包有效探索与评估LST与数据处理引擎的协同工作机制。基准包是反映目标工作负载的使用模式组合;LST-Bench支持灵活定义多样使用模式并将其组合为基准包,同时为完整性附带了基线基准包。我们的评估证明了该框架与基准包在跨异构环境中提取有价值洞察的有效性。LST-Bench代码已开源,详见 https://github.com/microsoft/lst-bench/ 。