LSM-tree-based data stores are widely adopted in industries for their excellent performance. As data scales increase, disk-based join operations become indispensable yet costly for the database, making the selection of suitable join methods crucial for system optimization. Current LSM-based stores generally adhere to conventional relational database practices and support only a limited number of join methods. However, the LSM-tree delivers distinct read and write efficiency compared to the relational databases, which could accordingly impact the performance of various join methods. Therefore, it is necessary to reconsider the selection of join methods in this context to fully explore the potential of various join algorithms and index designs. In this work, we present a systematic study and an exhaustive benchmark for joins over LSM-trees. We define a configuration space for join methods, encompassing various join algorithms, secondary index types, and consistency strategies. We also summarize a theoretical analysis to evaluate the overhead of each join method for an in-depth understanding. Furthermore, we implement all join methods in the configuration space on a unified platform and compare their performance through extensive experiments. Our theoretical and experimental results yield several insights and takeaways tailored to joins in LSM-based stores that aid developers in choosing proper join methods based on their working conditions.
翻译:基于LSM树的数据存储系统因其卓越性能在工业界得到广泛应用。随着数据规模的增长,基于磁盘的连接操作对数据库而言变得不可或缺但代价高昂,这使得选择合适的连接方法对系统优化至关重要。当前基于LSM的存储系统通常遵循传统关系型数据库的实践,仅支持有限数量的连接方法。然而,与关系型数据库相比,LSM树具有独特的读写效率特性,这可能相应地影响各种连接方法的性能。因此,有必要在此背景下重新审视连接方法的选择,以充分挖掘各种连接算法和索引设计的潜力。本研究对LSM树上的连接操作进行了系统性研究和详尽基准测试。我们定义了连接方法的配置空间,涵盖多种连接算法、二级索引类型和一致性策略。同时,我们通过理论分析总结评估了各种连接方法的开销,以深化理解。此外,我们在统一平台上实现了配置空间中的所有连接方法,并通过大量实验比较了它们的性能。我们的理论和实验结果为基于LSM的存储系统中的连接操作提供了若干针对性见解和要点,有助于开发者根据实际工作条件选择合适的连接方法。