Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: can an LLM learn to read Mel spectrogram directly without a dedicated speech encoder? We propose Mel-LLM, an encoder-free Speech-LLM that feeds lightly pre-processed Mel spectrogram patches directly into the LLM through a linear projection, allowing the LLM to learn speech-text alignment purely through its own parameters. We conduct extensive experiments on both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. For ASR, we evaluate on the OpenASR leaderboard public sets and production-level scaling experiments, demonstrating that the encoder-free solution achieves competitive performance with only limited degradation compared to encoder-initialized counterparts. We find that when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance. We also present ablation studies revealing which LLM layers are less relevant to speech encoding. For TTS, we show preliminary results with a next-token VAE approach. While TTS performance is not yet optimal, these results establish the feasibility of a fully unified encoder-free architecture for autoregressive speech-text modeling.
翻译:近期基于语音感知的大语言模型(Speech-LLMs)依赖预训练语音编码器将音频转换为可供LLM消费的语义丰富表征。然而在本研究中,我们探索:LLM能否在无需专用语音编码器的情况下直接学习读取梅尔语谱图?我们提出Mel-LLM——一种无编码器的语音-语言模型,通过线性投影将轻量预处理的梅尔语谱图块直接馈入LLM,使LLM完全通过自身参数学习语音-文本对齐。我们在自动语音识别(ASR)和文本转语音(TTS)任务上开展了大量实验。对于ASR,我们在OpenASR排行榜公开数据集及生产级规模扩展实验中评估,证明无编码器方案仅存在较之编码器初始化模型有限的性能降级,即可达到具有竞争力的表现。我们发现数据有限时,多模态检查点(Phi-4-MM)的初始化对维持性能至关重要。我们还通过消融实验揭示了哪些LLM层对语音编码关联性较低。对于TTS,我们采用下一标记变分自编码器方法展示了初步结果。尽管TTS性能尚未达到最优,这些结果确立了完全统一的编码器无关自回归语音-文本建模架构的可行性。