In order to support the real-time interaction with LLMs and the instant search or the instant recommendation on social media, it becomes an imminent problem to build a k-NN graph or an indexing graph for the massive number of vectorized multimedia data. In such scenarios, the scale of the data or the scale of the graph may exceed the processing capacity of a single machine. This paper aims to address the graph construction problem of such scale via efficient graph merge. For the graph construction on a single node, two generic and highly parallelizable algorithms, namely Two-way Merge and Multi-way Merge are proposed to merge subgraphs into one. For the graph construction across multiple nodes, a multi-node procedure based on Two-way Merge is presented. The procedure makes it feasible to construct a large-scale k-NN graph/indexing graph on either a single node or multiple nodes when the data size exceeds the memory capacity of one node. Extensive experiments are conducted on both large-scale k-NN graph and indexing graph construction. For the k-NN graph construction, the large-scale and high-quality k-NN graphs are constructed by graph merge in parallel. Typically, a billion-scale k-NN graph can be built in approximately 17h when only three nodes are employed. For the indexing graph construction, similar NN search performance as the original indexing graph is achieved with the merged indexing graphs while requiring much less time of construction.
翻译:为支持与大型语言模型的实时交互以及社交媒体上的即时搜索或即时推荐,为海量向量化多媒体数据构建k近邻图或索引图已成为亟待解决的问题。在此类场景中,数据规模或图规模可能超出单机处理能力。本文旨在通过高效的图融合方法解决此类规模的图构建问题。针对单节点上的图构建,提出了两种通用且高度可并行的算法,即双向融合与多向融合,用于将子图合并为单一图。针对跨多节点的图构建,提出了一种基于双向融合的多节点流程。该流程使得在数据规模超出单节点内存容量时,能够在单节点或多节点上构建大规模k近邻图/索引图成为可能。本文在大规模k近邻图与索引图构建上进行了广泛实验。对于k近邻图构建,通过并行图融合方法成功构建了大规模高质量k近邻图。典型地,仅使用三个节点即可在约17小时内完成十亿规模k近邻图的构建。对于索引图构建,融合所得索引图在达到与原始索引图相近的最近邻搜索性能的同时,所需构建时间显著减少。