We describe a design pattern and concrete implementation for embedding distributed approximate nearest neighbor indexes inside the Apache Iceberg table format, using the Puffin sidecar file as the storage container and the snapshot summary as the binding mechanism. Modern analytical query engines increasingly adopt a compute disaggregated architecture: executors are stateless, scale elastically, and read all data from object storage. Adding vector similarity search to such an engine traditionally requires a dedicated index storage layer with its own lifecycle, consistency model, and operational surface breaking the disaggregation in variant. We show that the Puffin format, originally introduced portable level statistics and deletion vectors, is sufficient to carry full Vamana graphs at billion vector scale, and that linking these blobs through the existing statistics file snapshot summary property reduces ANN index management to standard Iceberg snapshot operations. We present a binary layout for sharded graph indexes inside Puffin, a coordinator executor protocol for distributed index build, probe, and incremental refresh, the integration into the existing optimistic-concurrency commit path of an Iceberg REST catalog, and a tiered probe strategy that places small centroid indexes on the coordinator and large DiskANN graphs on executor SSDs. The pattern inherits atomicity, time travel, multi engine read ability, and orphan file garbage collection from the table format at zero implementation cost. We discuss the recall/latency trade-offs introduced by the independent-shard design and quantify projected query and build performance for tables up to 109 vectors. Our implementation extends FlockDB, a distributed MPP engine built on DuckDB.
翻译:我们描述了一种设计模式及具体实现,用于在Apache Iceberg表格式中嵌入分布式近似最近邻索引,采用Puffin侧边文件作为存储容器,并以快照摘要作为绑定机制。现代分析型查询引擎日益采用计算分离架构:执行器无状态、弹性扩展,并从对象存储中读取所有数据。传统上,为此类引擎添加向量相似度搜索需要专用的索引存储层,该层具有独立的生命周期、一致性模型和操作界面,从而破坏了架构的分离性。我们证明,最初为引入可移植的层级统计信息和删除向量而设计的Puffin格式,足以承载十亿向量规模的完整Vamana图,并且通过现有的统计文件快照摘要属性链接这些二进制大对象,可将近似最近邻索引管理简化为标准的Iceberg快照操作。我们提出了一种Puffin内部分片图索引的二进制布局、分布式索引构建、探测与增量刷新的协调器-执行器协议、与Iceberg REST目录现有乐观并发提交路径的集成,以及将小型质心索引部署于协调器、大型DiskANN图部署于执行器固态硬盘的分层探测策略。该模式从表格式中继承了原子性、时间旅行、多引擎可读性以及孤立文件垃圾回收功能,且无需额外实现成本。我们讨论了独立分片设计引入的召回率/延迟权衡,并量化了针对包含10⁹个向量的表的查询与构建性能。我们的实现基于FlockDB,一个构建于DuckDB之上的分布式大规模并行处理引擎。