Analytical database providers (e.g., Redshift, Databricks, BigQuery) have rapidly added support for invoking Large Language Models (LLMs) through native user-defined functions (UDFs) to help users perform natural language tasks, such as classification, entity extraction, and translation, inside analytical workloads. For instance, an analyst might want to extract customer sentiments on millions of product reviews. However, LLM inference is highly expensive in both computational and economic terms: for example, an NVIDIA L4 GPU running Llama2-7B can only process 6 KB of text per second. In this paper, we explore how to optimize LLM inference for analytical workloads that invoke LLMs within relational queries. We show that relational queries present novel opportunities for accelerating LLM inference, including reordering rows to maximize key-value (KV) cache reuse within the LLM inference engine, reordering columns within a row to further increase cache reuse, and deduplicating redundant inference requests. We implement these optimizations in Apache Spark, with vLLM as the model serving backend and achieve up to 4.4x improvement in end-to-end latency on a benchmark of diverse LLM-based queries on real datasets. To the best of our knowledge, this is the first work to explicitly address the problem of optimizing LLM invocations within SQL queries.
翻译:分析型数据库提供商(如Redshift、Databricks、BigQuery)已迅速支持通过原生用户自定义函数(UDF)调用大语言模型(LLM),帮助用户在分析工作负载中执行自然语言任务,例如分类、实体提取和翻译。例如,数据分析师可能需要对数百万条产品评论进行客户情感分析。然而,LLM推理在计算和经济成本方面都非常昂贵:以运行Llama2-7B的NVIDIA L4 GPU为例,每秒仅能处理6 KB文本。本文探讨如何优化在关系型查询中调用LLM的分析工作负载的推理效率。研究表明,关系型查询为加速LLM推理提供了新颖的优化机会,包括:重新排序行以最大化LLM推理引擎中的键值(KV)缓存复用、调整行内列顺序以进一步提升缓存复用效率,以及去重冗余推理请求。我们在Apache Spark中实现了这些优化,并以vLLM作为模型服务后端,在包含真实数据集上多种基于LLM查询的基准测试中,实现了最高4.4倍的端到端延迟改善。据我们所知,这是首个系统性地解决SQL查询中LLM调用优化问题的研究工作。