Similarity join--a widely used operation in data science--finds all pairs of items that have distance smaller than a threshold. Prior work has explored distributed computation methods to scale similarity join to large data volumes but these methods require a cluster deployment, and efficiency suffers from expensive inter-machine communication. On the other hand, disk-based solutions are more cost-effective by using a single machine and storing the large dataset on high-performance external storage, such as NVMe SSDs, but in these methods the disk I/O time is a serious bottleneck. In this paper, we propose DiskJoin, the first disk-based similarity join algorithm that can process billion-scale vector datasets efficiently on a single machine. DiskJoin improves disk I/O by tailoring the data access patterns to avoid repetitive accesses and read amplification. It also uses main memory as a dynamic cache and carefully manages cache eviction to improve cache hit rate and reduce disk retrieval time. For further acceleration, we adopt a probabilistic pruning technique that can effectively prune a large number of vector pairs from computation. Our evaluation on real-world, large-scale datasets shows that DiskJoin significantly outperforms alternatives, achieving speedups from 50x to 1000x.
翻译:相似性连接是数据科学中广泛使用的操作,旨在找出所有距离小于给定阈值的项目对。先前的研究探索了分布式计算方法以将相似性连接扩展至大规模数据量,但这些方法需要集群部署,且效率受昂贵的机器间通信影响。另一方面,基于磁盘的解决方案通过使用单台机器并将大规模数据集存储于高性能外部存储(如NVMe SSD)上更具成本效益,但在这些方法中磁盘I/O时间成为严重瓶颈。本文提出DiskJoin,首个基于磁盘的相似性连接算法,能够在单台机器上高效处理十亿级向量数据集。DiskJoin通过定制数据访问模式以避免重复访问和读取放大,从而优化磁盘I/O。它还将主内存用作动态缓存,并精心管理缓存置换以提高缓存命中率并减少磁盘检索时间。为进一步加速,我们采用一种概率剪枝技术,能够有效剪除大量向量对的计算。在真实世界大规模数据集上的评估表明,DiskJoin显著优于现有方案,实现了50倍至1000倍的加速比。