Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents. As performance demands for vector search systems surge, accelerated hardware offers a promising solution in the post-Moore's Law era. We introduce \textit{FANNS}, an end-to-end and scalable vector search framework on FPGAs. Given a user-provided recall requirement on a dataset and a hardware resource budget, \textit{FANNS} automatically co-designs hardware and algorithm, subsequently generating the corresponding accelerator. The framework also supports scale-out by incorporating a hardware TCP/IP stack in the accelerator. \textit{FANNS} attains up to 23.0$\times$ and 37.2$\times$ speedup compared to FPGA and CPU baselines, respectively, and demonstrates superior scalability to GPUs, achieving 5.5$\times$ and 7.6$\times$ speedup in median and 95\textsuperscript{th} percentile (P95) latency within an eight-accelerator configuration. The remarkable performance of \textit{FANNS} lays a robust groundwork for future FPGA integration in data centers and AI supercomputers.
翻译:向量搜索已成为大规模信息检索和机器学习系统的基础,像Google和Bing这样的搜索引擎每秒处理数万次查询,通过计算编码查询文本与网络文档之间的向量相似度,在PB级文档数据集上实现检索。随着向量搜索系统性能需求的激增,加速硬件在后摩尔定律时代提供了一种有前景的解决方案。我们提出\textit{FANNS},一种基于FPGA的端到端、可扩展的向量搜索框架。给定用户对数据集提出的召回率要求与硬件资源预算,\textit{FANNS}自动协同设计硬件与算法,并随后生成相应的加速器。该框架还通过在加速器中集成硬件TCP/IP协议栈支持横向扩展。与FPGA和CPU基线相比,\textit{FANNS}分别实现了高达23.0倍和37.2倍的加速,并展现出优于GPU的可扩展性,在八加速器配置下,中位数和95分位数(P95)延迟分别实现了5.5倍和7.6倍的加速。\textit{FANNS}的卓越性能为未来FPGA在数据中心和AI超级计算机中的集成奠定了坚实的基础。