Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents. As performance demands for vector search systems surge, accelerated hardware offers a promising solution in the post-Moore's Law era. We introduce \textit{FANNS}, an end-to-end and scalable vector search framework on FPGAs. Given a user-provided recall requirement on a dataset and a hardware resource budget, \textit{FANNS} automatically co-designs hardware and algorithm, subsequently generating the corresponding accelerator. The framework also supports scale-out by incorporating a hardware TCP/IP stack in the accelerator. \textit{FANNS} attains up to 23.0$\times$ and 37.2$\times$ speedup compared to FPGA and CPU baselines, respectively, and demonstrates superior scalability to GPUs, achieving 5.5$\times$ and 7.6$\times$ speedup in median and 95\textsuperscript{th} percentile (P95) latency within an eight-accelerator configuration. The remarkable performance of \textit{FANNS} lays a robust groundwork for future FPGA integration in data centers and AI supercomputers.
翻译:向量搜索已成为大规模信息检索和机器学习系统的基础,诸如Google和Bing等搜索引擎通过评估编码查询文本与网络文档之间的向量相似度,每秒处理数万次查询,覆盖PB级文档数据集。随着向量搜索系统性能需求的激增,加速硬件在后摩尔定律时代提供了一种有前景的解决方案。我们提出了\textit{FANNS}——一种基于FPGA的端到端可扩展向量搜索框架。给定用户对数据集设定的召回率要求和硬件资源预算,\textit{FANNS}自动协同设计硬件和算法,并生成相应的加速器。该框架还通过在加速器中集成硬件TCP/IP协议栈来支持横向扩展。与FPGA和CPU基线相比,\textit{FANNS}分别实现了高达23.0倍和37.2倍的加速;与GPU相比,它在八加速器配置下的中位数和第95百分位延迟上展现出卓越的可扩展性,分别实现5.5倍和7.6倍的加速。\textit{FANNS}的卓越性能为未来数据中心和AI超级计算机中FPGA的集成奠定了坚实基础。