Optimizing LLM Queries in Relational Workloads

Analytical database providers (e.g., Redshift, Databricks, BigQuery) have rapidly added support for invoking Large Language Models (LLMs) through native user-defined functions (UDFs) to help users perform natural language tasks, such as classification, entity extraction, and translation, inside analytical workloads. For instance, an analyst might want to extract customer sentiments on millions of product reviews. However, LLM inference is highly expensive in both computational and economic terms: for example, an NVIDIA L4 GPU running Llama2-7B can only process 6 KB of text per second. In this paper, we explore how to optimize LLM inference for analytical workloads that invoke LLMs within relational queries. We show that relational queries present novel opportunities for accelerating LLM inference, including reordering rows to maximize key-value (KV) cache reuse within the LLM inference engine, reordering columns within a row to further increase cache reuse, and deduplicating redundant inference requests. We implement these optimizations in Apache Spark, with vLLM as the model serving backend and achieve up to 4.4x improvement in end-to-end latency on a benchmark of diverse LLM-based queries on real datasets. To the best of our knowledge, this is the first work to explicitly address the problem of optimizing LLM invocations within SQL queries.

翻译：分析型数据库提供商（如Redshift、Databricks、BigQuery）已迅速支持通过原生用户自定义函数（UDF）调用大语言模型（LLM），帮助用户在分析工作负载中执行自然语言任务，例如分类、实体提取和翻译。例如，数据分析师可能需要对数百万条产品评论进行客户情感分析。然而，LLM推理在计算和经济成本方面都非常昂贵：以运行Llama2-7B的NVIDIA L4 GPU为例，每秒仅能处理6 KB文本。本文探讨如何优化在关系型查询中调用LLM的分析工作负载的推理效率。研究表明，关系型查询为加速LLM推理提供了新颖的优化机会，包括：重新排序行以最大化LLM推理引擎中的键值（KV）缓存复用、调整行内列顺序以进一步提升缓存复用效率，以及去重冗余推理请求。我们在Apache Spark中实现了这些优化，并以vLLM作为模型服务后端，在包含真实数据集上多种基于LLM查询的基准测试中，实现了最高4.4倍的端到端延迟改善。据我们所知，这是首个系统性地解决SQL查询中LLM调用优化问题的研究工作。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】视觉提示调整（VPT），Vision Prompt Tuning

专知会员服务

32+阅读 · 2022年3月12日

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日