Data lakes have emerged as a flexible and scalable solution for storing and analyzing large volumes of heterogeneous data, including structured, semi-structured, and unstructured formats. Despite their growing adoption in both industry and academia, there is a lack of standardized and comprehensive benchmarks for evaluating the performance of data lake systems. Existing benchmarks primarily target traditional data warehouses and focus on structured SQL workloads, making them insufficient for capturing the diverse workloads and access patterns typical of data lakes. In this work, we propose a new benchmarking framework for data lakes that aims to provide an objective and comparative evaluation of different data lake implementations. Our benchmark covers multiple data types and workload models, including data retrieval, aggregation, querying, and similarity search, which is a common yet underexplored operation in existing benchmarks. We measure key performance metrics such as query execution time, metadata generation time, and metadata size across different scale factors. The benchmark is designed to be extensible and reproducible, enabling users to generate datasets and evaluate data lake systems under realistic and diverse scenarios. We conduct our experiments on CloudLab and demonstrate how the proposed benchmark can be used to compare both commercial and open-source data lake platforms.
翻译:数据湖已成为一种灵活且可扩展的解决方案,用于存储和分析包括结构化、半结构化和非结构化格式在内的大规模异构数据。尽管数据湖在工业界和学术界日益普及,但目前仍缺乏用于评估数据湖系统性能的标准化、综合性基准测试。现有基准测试主要针对传统数据仓库,并侧重于结构化SQL工作负载,这使其不足以捕捉数据湖典型的多类工作负载和访问模式。在本研究中,我们提出了一种新的数据湖基准测试框架,旨在为不同数据湖实现提供客观且可比较的评估。我们的基准测试涵盖多种数据类型和工作负载模型,包括数据检索、聚合、查询和相似性搜索——后者是现有基准测试中常见但尚未充分探索的操作。我们测量了不同规模因子下的关键性能指标,如查询执行时间、元数据生成时间和元数据大小。该基准测试被设计为可扩展和可复现的,使用户能够在现实且多样的场景下生成数据集并评估数据湖系统。我们在CloudLab上进行了实验,并展示了如何使用所提出的基准测试来比较商业和开源数据湖平台。