Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly and efficiently. We present VoxServe, a unified serving system for SpeechLMs that optimizes streaming performance. VoxServe introduces a model-execution abstraction that decouples model architecture from system-level optimizations, thereby enabling support for diverse SpeechLM architectures within a single framework. Building on this abstraction, VoxServe implements streaming-aware scheduling and an asynchronous inference pipeline to improve end-to-end efficiency. Evaluations across multiple modern SpeechLMs show that VoxServe achieves 10-20x higher throughput than existing implementations at comparable latency while maintaining high streaming viability. The code of VoxServe is available at https://github.com/vox-serve/vox-serve.
翻译:在现代语音语言模型(SpeechLMs)的流式部署场景中,需要系统能够提供低延迟、高吞吐量以及可靠的流式处理保障。现有系统在灵活高效地支持多样化模型方面存在不足。本文提出VoxServe,一种为语音语言模型设计的统一服务系统,专门针对流式性能进行优化。VoxServe引入了一种模型执行抽象层,将模型架构与系统级优化解耦,从而能够在单一框架内支持多种语音语言模型架构。基于此抽象层,VoxServe实现了流式感知调度与异步推理流水线,以提升端到端效率。在多种现代语音语言模型上的评估表明,在保持相近延迟的同时,VoxServe的吞吐量达到现有实现方案的10-20倍,同时维持了优异的流式可行性。VoxServe的代码已开源:https://github.com/vox-serve/vox-serve。