Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.
翻译:大型语言模型(LLMs)已革新多种自然语言处理任务的最新技术水平。尽管服务LLMs在计算和内存方面要求较高,但小型语言模型(SLMs)的兴起为资源受限用户提供了新机遇,使其能够以尖端性能服务小型模型。本文设计了一系列实验,旨在从性能和能耗两个维度对SLM推理进行基准测试。我们的分析为模型服务提供了新视角,揭示了SLM的小内存占用使其能够在单个加速器的资源容量内达到帕累托最优吞吐量。为此,我们初步展示了一组发现,阐释了模型复制如何有效改善SLM服务的资源利用率。