Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization. In this paper, we present General-Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications. This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The resulting model family supports efficient multi-scale deployment, including a lightweight 0.3B-parameter variant optimized for edge and resource-constrained environments. Together, these design choices demonstrate that a unified autoregressive architecture can achieve competitive performance across diverse speech tasks while remaining viable for low-latency, practical deployment.
翻译:传统语音系统通常依赖独立的任务专用模型来实现文本转语音(TTS)、自动语音识别(ASR)和语音转换(VC),这种碎片化的处理流程限制了系统的可扩展性、效率以及跨任务泛化能力。本文提出通用音频基础模型(General-Purpose Audio, GPA),这是一种统一音频基础模型,它将多个核心语音任务集成于单一大型语言模型(LLM)架构中。GPA在共享的离散音频标记空间上运行,支持基于指令的任务引导,使得单一自回归模型无需修改架构即可灵活执行TTS、ASR和VC任务。该统一设计融合了离散语音标记上的完全自回归建模、跨语音领域的联合多任务训练,以及可实现高并发与高吞吐量的可扩展推理流程。由此构建的模型系列支持高效的多尺度部署,包括一个针对边缘及资源受限环境优化的轻量级0.3B参数变体。这些设计共同表明,统一的自回归架构能够在多种语音任务上取得有竞争力的性能,同时保持低延迟实际部署的可行性。