Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.
翻译:大型语言模型(LLMs)可通过适配将其文本能力扩展至语音输入。然而,这些经过语音适配的LLMs在语言理解任务上始终表现逊色于其纯文本版本——甚至不及级联处理流水线。我们将这种不足称为文本-语音理解差距:即当语音适配的LLM处理语音输入时,相较于原始文本LLM处理等效文本时观察到的性能下降。近期缩小该差距的方法要么依赖大规模文本语料的语音合成(成本高昂且严重依赖合成数据),要么依赖大规模专有语音数据集(难以复现)。因此,目前仍需更高效的数据替代方案来弥合文本-语音理解差距。本研究将该差距归因于两个驱动因素:(i)适配过程中文本能力的遗忘,以及(ii)语音与文本的跨模态失准。基于此分析,我们提出SALAD——通过主动选择与跨模态蒸馏实现样本高效对齐——该方法结合跨模态蒸馏与定向合成数据,在提升对齐度的同时缓解遗忘问题。在30亿和70亿参数的LLMs上应用SALAD,其在知识、语言理解与推理等广域基准测试中与强开源模型取得相当性能,而训练所用的公共语料库语音数据量减少超过一个数量级。