Hybrid queries, which combine vector nearest neighbor searches with scalar predicates, represent a fundamental challenge in managing vector databases. Existing methods often restrict the number of vector columns involved or the complexity of scalar predicates, thereby limiting their flexibility in handling diverse query patterns. Moreover, these approaches typically do not fully leverage the correlations between scalar and vector attributes, or the distributional patterns observed from query vector neighborhoods. To address these limitations, we introduce BoomHQ, a learning-based framework to boost multiple hybrid queries on vector DBMSs. First, BoomHQ models the correlation between vector and scalar attributes using an autoencoder-based architecture, which is also friendly to data updates. Second, BoomHQ captures prevailing query patterns, particularly using estimated selectivity of scalar predicates within the neighborhood of a query vector. Guided by these two key features, BoomHQ predicts the execution hints and rewrites the original query into an optimized version. Furthermore, we extend well-known benchmarks by introducing vector and scalar data with inherent correlations to better evaluate query execution. Experimental results demonstrate that for multiple hybrid queries at specified recall thresholds, our method achieves a 2x average and over 25x peak speedup compared to the state-of-the-art. Additionally, BoomHQ shows strong robustness against data updates and consistent optimization effectiveness across three representative vector database systems.
翻译:混合查询将向量最近邻搜索与标量谓词相结合,是向量数据库管理中的一项基本挑战。现有方法通常限制所涉及的向量列数量或标量谓词的复杂度,从而限制了其在处理多样查询模式时的灵活性。此外,这些方法通常未能充分利用标量属性与向量属性之间的相关性,或从查询向量邻域中观察到的分布模式。为应对这些限制,我们提出BoomHQ——一种基于学习的框架,用于加速向量数据库管理系统上的多混合查询。首先,BoomHQ采用基于自编码器的架构对向量属性与标量属性之间的相关性进行建模,该架构同时支持便捷的数据更新。其次,BoomHQ捕获主流查询模式,特别是通过估计查询向量邻域内标量谓词的选择性。在这两大特性的指导下,BoomHQ预测执行提示,并将原始查询重写为优化版本。此外,我们通过引入具有内在相关性的向量与标量数据来扩展现有基准测试,以更有效地评估查询执行。实验结果表明,在指定召回率阈值下,针对多混合查询,我们的方法相比现有最优方案实现了平均2倍、峰值超过25倍的加速。同时,BoomHQ对数据更新展现出强鲁棒性,并在三个代表性向量数据库系统中保持一致的优化效果。