Today's Internet infrastructure is centered around content retrieval over HTTP, with middleboxes (e.g., HTTP proxies) playing a crucial role in performance, security, and cost-effectiveness. We envision a future where Internet communication will be dominated by "prompts" sent to generative AI models. For this, we will need proxies that provide similar functions to HTTP proxies (e.g., caching, routing, compression) while dealing with unique challenges and opportunities of prompt-based communication. As a first step toward supporting prompt-based communication, we present LLMBridge, an LLM proxy designed for cost-conscious users, such as those in developing regions and education (e.g., students, instructors). LLMBridge supports three key optimizations: model selection (routing prompts to the most suitable model), context management (intelligently reducing the amount of context), and semantic caching (serving prompts using local models and vector databases). These optimizations introduce trade-offs between cost and quality, which applications navigate through a high-level, bidirectional interface. As case studies, we deploy LLMBridge in two cost-sensitive settings: a WhatsApp-based Q&A service and a university classroom environment. The WhatsApp service has been live for over twelve months, serving 100+ users and handling more than 14.7K requests. In parallel, we exposed LLMBridge to students across three computer science courses over a semester, where it supported diverse LLM-powered applications - such as reasoning agents and chatbots - and handled an average of 500 requests per day. We report on deployment experiences across both settings and use the collected workloads to benchmark the effectiveness of various cost-optimization strategies, analyzing their trade-offs in cost, latency, and response quality.
翻译:当今互联网基础设施围绕基于HTTP的内容检索构建,中间件(如HTTP代理)在性能、安全性和成本效益方面发挥着关键作用。我们预见未来互联网通信将主要由发送至生成式AI模型的"提示"所主导。为此,我们需要能够提供类似HTTP代理功能(如缓存、路由、压缩)的代理服务器,同时应对基于提示的通信所特有的挑战与机遇。作为支持基于提示通信的第一步,我们提出LLMBridge——一个专为成本敏感型用户(如发展中地区和教育领域的用户,包括学生和教师)设计的大语言模型代理。LLMBridge支持三项关键优化:模型选择(将提示路由至最合适的模型)、上下文管理(智能缩减上下文量)和语义缓存(利用本地模型与向量数据库处理提示)。这些优化在成本与质量之间引入了权衡取舍,应用程序通过高层双向接口来协调这些权衡。作为案例研究,我们在两种成本敏感场景中部署了LLMBridge:基于WhatsApp的问答服务和大学课堂环境。该WhatsApp服务已持续运行超过12个月,服务100多名用户,处理超过1.47万次请求。同时,我们在一学期内向三门计算机科学课程的学生开放LLMBridge,该系统支持多种基于大语言模型的应用程序(如推理智能体和聊天机器人),日均处理约500次请求。我们报告了两种场景下的部署经验,并利用收集的工作负载对各种成本优化策略的有效性进行基准测试,分析其在成本、延迟和响应质量方面的权衡关系。