LSM-tree-based data stores are widely adopted in industries for their excellent performance. As data scales increase, disk-based join operations become indispensable yet costly for the database, making the selection of suitable join methods crucial for system optimization. Current LSM-based stores generally adhere to conventional relational database practices and support only a limited number of join methods. However, the LSM-tree delivers distinct read and write efficiency compared to the relational databases, which could accordingly impact the performance of various join methods. Therefore, it is necessary to reconsider the selection of join methods in this context to fully explore the potential of various join algorithms and index designs. In this work, we present a systematic study and an exhaustive benchmark for joins over LSM-trees. We define a configuration space for join methods, encompassing various join algorithms, secondary index types, and consistency strategies. We also summarize a theoretical analysis to evaluate the overhead of each join method for an in-depth understanding. Furthermore, we implement all join methods in the configuration space on a unified platform and compare their performance through extensive experiments. Our theoretical and experimental results yield several insights and takeaways tailored to joins in LSM-based stores that aid developers in choosing proper join methods based on their working conditions.
翻译:基于LSM树的数据存储系统因其卓越性能在工业界得到广泛应用。随着数据规模的增长,基于磁盘的连接操作对数据库而言变得不可或缺但代价高昂,这使得选择合适的连接方法对系统优化至关重要。当前基于LSM的存储系统通常遵循传统关系型数据库的实践,仅支持有限数量的连接方法。然而,与关系型数据库相比,LSM树具有独特的读写效率特性,这可能会相应影响各类连接方法的性能表现。因此,有必要在此背景下重新审视连接方法的选择,以充分挖掘不同连接算法与索引设计的潜力。本研究针对LSM树上的连接操作进行了系统性研究并建立了详尽的基准测试框架。我们定义了连接方法的配置空间,涵盖多种连接算法、二级索引类型及一致性策略。同时,我们通过理论分析总结了评估各连接方法开销的框架,以深化理解。此外,我们在统一平台上实现了配置空间中的所有连接方法,并通过大量实验比较了它们的性能。我们的理论与实验结果提供了若干针对LSM存储系统中连接操作的洞见与启示,可帮助开发者根据实际工作场景选择合适的连接方法。