Recent advancements in language models (LMs) have gained substantial attentions on their capability to generate human-like responses. Though exhibiting a promising future for various applications such as conversation AI, these LMs face deployment challenges on various devices due to their extreme computational cost and unpredictable inference latency. Such varied inference latency, identified as a consequence of uncertainty intrinsic to the nature of language, can lead to computational inefficiency and degrade the overall performance of LMs, especially under high-traffic workloads. Unfortunately, the bandwidth of these uncertainty sources is extensive, complicating the prediction of latency and the effects emanating from such uncertainties. To understand and mitigate the impact of uncertainty on real-time response-demanding systems, we take the first step to comprehend, quantify and optimize these uncertainty-induced latency performance variations in LMs. Specifically, we present RT-LM, an uncertainty-aware resource management ecosystem for real-time inference of LMs. RT-LM innovatively quantifies how specific input uncertainties, adversely affect latency, often leading to an increased output length. Exploiting these insights, we devise a lightweight yet effective method to dynamically correlate input text uncertainties with output length at runtime. Utilizing this quantification as a latency heuristic, we integrate the uncertainty information into a system-level scheduler which explores several uncertainty-induced optimization opportunities, including uncertainty-aware prioritization, dynamic consolidation, and strategic CPU offloading. Quantitative experiments across five state-of-the-art LMs on two hardware platforms demonstrates that RT-LM can significantly reduce the average response time and improve throughput while incurring a rather small runtime overhead.
翻译:近期语言模型(LMs)的进展因其生成类人响应的能力而备受关注。尽管在对话AI等应用中展现出广阔前景,但由于其极高的计算成本与不可预测的推理延迟,这些LMs在各类设备上的部署面临挑战。这种因语言固有不确定性导致的推理延迟变异,可能造成计算效率低下并降低LMs的整体性能,尤其在高流量工作负载下更为显著。然而,这些不确定性来源的带宽极其广泛,使得延迟预测及其影响评估变得复杂。为理解并减轻不确定性对实时响应需求系统的影响,我们首次尝试理解、量化并优化LMs中由不确定性引发的延迟性能变异。具体而言,我们提出RT-LM——一种面向LMs实时推理的不确定性感知资源管理生态系统。RT-LM创新性地量化了特定输入不确定性对延迟的负面影响(常导致输出长度增加),并据此设计轻量高效方法,在运行时动态关联输入文本不确定性与输出长度。利用该量化作为延迟启发信息,我们将不确定性信息集成至系统级调度器中,探索若干不确定性引发的优化机会,包括不确定性感知优先级排序、动态整合及策略性CPU卸载。在两个硬件平台上对五种最先进LMs的定量实验表明,RT-LM能在保持极小运行时开销的同时,显著降低平均响应时间并提升吞吐量。