Vector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents. As performance demands for vector search systems surge, accelerated hardware offers a promising solution in the post-Moore's Law era. We introduce \textit{FANNS}, an end-to-end and scalable vector search framework on FPGAs. Given a user-provided recall requirement on a dataset and a hardware resource budget, \textit{FANNS} automatically co-designs hardware and algorithm, subsequently generating the corresponding accelerator. The framework also supports scale-out by incorporating a hardware TCP/IP stack in the accelerator. \textit{FANNS} attains up to 23.0$\times$ and 37.2$\times$ speedup compared to FPGA and CPU baselines, respectively, and demonstrates superior scalability to GPUs, achieving 5.5$\times$ and 7.6$\times$ speedup in median and 95\textsuperscript{th} percentile (P95) latency within an eight-accelerator configuration. The remarkable performance of \textit{FANNS} lays a robust groundwork for future FPGA integration in data centers and AI supercomputers.
翻译:向量搜索已成为大规模信息检索与机器学习系统的基石,诸如谷歌和必应等搜索引擎通过评估编码后的查询文本与网络文档之间的向量相似性,每秒处理数万次查询,覆盖百亿亿字节级文档数据集。随着向量搜索系统性能需求的激增,加速硬件在后摩尔定律时代展现出极具前景的解决方案。我们提出FANNS——一种基于FPGA的端到端可扩展向量搜索框架。给定用户针对特定数据集提出的召回率要求以及硬件资源预算,FANNS能够自动协同设计硬件与算法,并生成相应的加速器。该框架通过集成硬件TCP/IP协议栈支持横向扩展。相较于FPGA与CPU基线,FANNS分别实现了最高23.0倍和37.2倍的加速;同时,其在八加速器配置下中位数延迟与95百分位(P95)延迟分别达到5.5倍和7.6倍的加速,展现出优于GPU的可扩展性。FANNS的卓越性能为未来FPGA在数据中心与AI超级计算机中的集成奠定了坚实基础。