In this paper, we investigate the effectiveness of utilizing CDF-based learned indexes in indexed-nested loop joins for both sorted and unsorted data in external memory. Our experimental study seeks to determine whether the advantages of learned indexes observed in in-memory joins by Sabek and Kraska (VLDB 2023) extend to the external memory context. First, we introduce two optimizations for integrating learned indexes into external-memory joins. Subsequently, we conduct an extensive evaluation, employing hash join, sort join, and indexed-nested loop join with real-world and simulated datasets. Furthermore, we independently assess the learned index-based join across various dimensions, including storage device types, key types, data sorting, parallelism, constrained memory settings, and increasing model error. Our experiments indicate that B-trees and learned indexes exhibit largely similar performance in external-memory joins. Learned indexes offer advantages in terms of smaller index size and faster lookup performance. However, their construction time is approximately $1000\times$ higher. While learned indexes can be significantly smaller ($2\times$-$4\times$) than the internal nodes of a B-tree index, these internal nodes constitute only 0.4 to 1% of the data size and typically fit in main memory in most practical scenarios. Additionally, unlike in the in-memory setting, learned indexes can prioritize faster construction over accuracy (larger error window) without significantly affecting query performance.
翻译:本文研究了基于累积分布函数的学习索引在外部存储器中针对有序和无序数据的索引嵌套循环连接中的有效性。我们的实验研究旨在验证Sabek和Kraska(VLDB 2023)在内存连接中观察到的学习索引优势是否适用于外部存储环境。首先,我们提出了两项将学习索引集成到外存连接中的优化方法。随后,我们采用哈希连接、排序连接及索引嵌套循环连接,在真实世界和模拟数据集上进行了广泛评估。此外,我们从多个维度独立评估了基于学习索引的连接性能,包括存储设备类型、键类型、数据排序方式、并行性、受限内存配置以及递增的模型误差。实验结果表明,B树与学习索引在外存连接中表现出大体相近的性能。学习索引在索引体积更小和查找速度更快方面具有优势,但其构建时间约高出$1000\times$。虽然学习索引的体积可比B树索引的内部节点显著减小($2\times$-$4\times$),但这些内部节点仅占数据量的0.4%至1%,在多数实际场景中通常可完全驻留于主内存。此外,与内存环境不同,学习索引在外存场景中可优先考虑构建速度而非精度(允许更大的误差窗口),而不会显著影响查询性能。