Large Audio-Language Models (LALMs) are often constrained by short audio context windows, even when their text backbones support long contexts, limiting long-form audio understanding. Prior work has introduced context-extension methods (e.g. YaRN) on unimodal LLMs, yet their application to LALMs remains unexplored. First, building on RoPE-based context extension, we introduce Partial YaRN, a training-free, modality-decoupled extension method that modifies only audio token positions, leaving text positions intact to preserve the base LLM's text capabilities. Second, we propose Virtual Longform Audio Training (VLAT), a training strategy that extends Partial YaRN into a training-time positional augmentation. VLAT simulates diverse audio lengths during training, enabling generalization to inputs far longer than those seen in training. Our experiments on SALMONN and Qwen2-Audio confirm that Partial YaRN outperforms the original models across wide range of settings, and VLAT provides substantial performance improvement on long audio of unseen lengths.
翻译:大型音频-语言模型通常受限于较短的音频上下文窗口,即使其文本主干网络支持长上下文,这限制了长音频理解能力。先前的研究已在单模态大语言模型中引入了上下文扩展方法(如YaRN),但其在音频-语言模型中的应用尚未探索。首先,基于RoPE的上下文扩展,我们提出了Partial YaRN——一种免训练、模态解耦的扩展方法,该方法仅修改音频标记的位置编码,保持文本位置不变以保留基础大语言模型的文本能力。其次,我们提出了虚拟长音频训练策略,将Partial YaRN扩展为训练时的位置增强方法。该策略通过在训练中模拟不同长度的音频输入,使模型能够泛化到远超训练所见长度的输入。我们在SALMONN和Qwen2-Audio上的实验表明:Partial YaRN在多种设定下均优于原始模型,且虚拟长音频训练策略对未见长度的长音频任务带来显著的性能提升。