Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.
翻译:尽管大规模语音语言模型(LSLMs)在处理短时声学信号方面取得了日益显著的成功,但其向长时音频理解的扩展却面临严重瓶颈。这一局限源于有限的上下文长度以及长时推理所需的过高内存占用。在本工作中,我们提出Speech-XL,一种新模型,它利用大规模语言模型(LLMs)固有的键值(KV)稀疏化能力,实现高比率的语音输入压缩。具体而言,我们为每个语音片段引入一种新颖的特殊标记——语音摘要标记(SST),将片段内的语音信息封装到其关联的KV对中。SST模块通过指令微调进行训练,采用课程学习策略,使SST以渐进方式学习信息压缩——从低比率(简单)到高比率(挑战性)压缩逐步推进。尽管使用的训练数据显著少于其他基线模型,我们的模型在包括LongSpeech和AUDIOMARATHON在内的主要基准测试中均取得了极具竞争力的性能。通过解决长时音频建模中长期存在的瓶颈问题,我们的方法为长时声学序列的压缩提供了新的视角。