100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Yeounoh Chung,Rushabh Desai,Jian He,Yu Xiao,Thibaud Hottelier,Yves-Laurent Kom Samo,Pushkar Khadilkar,Xianshun Chen,Sam Idicula,Fatma Özcan,Alon Halevy,Yannis Papakonstantinou

Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of queries one can express over the combination of structured and unstructured data. LLMs offer remarkable semantic reasoning capabilities, making them an essential tool for complex and nuanced queries that blend structured and unstructured data. While extremely powerful, these AI queries can become prohibitively costly when invoked thousands of times. This paper provides an extensive evaluation of a recent AI query approximation approach that enables low cost analytics and database applications to benefit from AI queries. The approach delivers >100x cost and latency reduction for the semantic filter operator and also important gains for semantic ranking. The cost and performance gains come from utilizing cheap and accurate proxy models over embedding vectors. We show that despite the massive gains in latency and cost, these proxy models preserve accuracy and occasionally improve accuracy across various benchmark datasets, including the extended Amazon reviews benchmark that has 10M rows. We present an OLAP-friendly architecture within Google BigQuery for this approach for purely online (ad hoc) queries, and a low-latency HTAP database-friendly architecture in AlloyDB that could further improve the latency by moving the proxy model training offline. We present techniques that accelerate the proxy model training.

翻译：近年，多家数据仓库与数据库供应商在SQL中引入了名为AI查询的扩展功能，允许用户通过大语言模型(LLM)评估的SQL函数和条件来指定操作，从而显著扩展了在结构化与非结构化数据组合上可表达的查询类型。LLM展现出卓越的语义推理能力，使其成为处理融合结构化与非结构化数据的复杂精细化查询的关键工具。尽管功能极为强大，但这类AI查询在调用数千次时可能产生高昂成本。本文对近期提出的一种AI查询近似方法进行了全面评估，该方法使低成本分析与数据库应用能够受益于AI查询。该技术为语义过滤算子实现了超过100倍的延迟与成本削减，同时在语义排序环节也取得了重要改进。其成本与性能提升源于利用嵌入向量上廉价且精准的代理模型。研究表明，尽管在延迟与成本上获得显著增益，这些代理模型仍能保持准确性，甚至在某些基准数据集（包括包含1000万行的扩展版亚马逊评论基准）上提升了准确率。我们在Google BigQuery中为纯在线（即席）查询设计了一种兼容OLAP的系统架构，并在AlloyDB中构建了具备低延迟特性的HTAP数据库友好架构——通过将代理模型训练迁移至离线模式，可进一步优化延迟。此外，本文还提出了加速代理模型训练的技术方案。