Vector databases have emerged as key enablers for bridging intelligent applications with unstructured data, providing generic search and management support for embedding vectors extracted from the raw unstructured data. As multiple data users can share the same database infrastructure, multi-tenancy support for vector databases is increasingly desirable. This hinges on an efficient filtered search operation, i.e., only querying the vectors accessible to a particular tenant. Multi-tenancy in vector databases is currently achieved by building either a single, shared index among all tenants, or a per-tenant index. The former optimizes for memory efficiency at the expense of search performance, while the latter does the opposite. Instead, this paper presents Curator, an in-memory vector index design tailored for multi-tenant queries that simultaneously achieves the two conflicting goals, low memory overhead and high performance for queries, vector insertion, and deletion. Curator indexes each tenant's vectors with a tenant-specific clustering tree and encodes these trees compactly as sub-trees of a shared clustering tree. Each tenant's clustering tree adapts dynamically to its unique vector distribution, while maintaining a low per-tenant memory footprint. Our evaluation, based on two widely used data sets, confirms that Curator delivers search performance on par with per-tenant indexing, while maintaining memory consumption at the same level as metadata filtering on a single, shared index.
翻译:向量数据库已成为连接智能应用与非结构化数据的关键支撑,为从原始非结构化数据中提取的嵌入向量提供通用搜索与管理支持。随着多个数据用户可共享同一数据库基础设施,向量数据库的多租户支持日益重要。这依赖于高效的过滤搜索操作,即仅查询特定租户可访问的向量。当前向量数据库的多租户实现方式有两种:构建所有租户共享的单一索引,或为每个租户构建独立索引。前者以牺牲搜索性能为代价优化内存效率,后者则相反。本文提出Curator,一种专为多租户查询设计的内存向量索引方案,可同时实现低内存开销与查询、向量插入及删除的高性能这两个冲突目标。Curator通过每个租户特有的聚类树索引其向量,并将这些树紧凑地编码为共享聚类树的子树。各租户的聚类树能动态适应其独特的向量分布,同时保持低内存占用。基于两个广泛使用的数据集的评估证实,Curator可提供与独立租户索引相当的搜索性能,同时将内存消耗维持在单共享索引元数据过滤的同等水平。