Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nevertheless, they often struggle when applied to low-resource tasks. In case in-domain labeled data are scarce or mismatched with the true test distribution, direct fine-tuning can be brittle. In-Context Learning (ICL) provides a training-free, inference-time solution by adapting auditory LLMs through conditioning on a few in-domain demonstrations. In this work, we first show that $\textit{Vanilla ICL}$, improves zero-shot performance across diverse speech and audio tasks for selected models which suggest that this ICL adaptation capability can be generalized to multimodal setting. Building on this, we propose $\textbf{Meta Speech In-Context Learning (MetaSICL)}$, a post-training recipe utilizes only high resource speech data from various tasks intending to strengthen model's in-context learning capability. Experiments indicate our proposed method outperforms direct fine-tuning in low-resource scenario.
翻译:听觉大语言模型在广泛的语音与音频理解任务中展现出强大的性能。然而,当应用于低资源任务时,它们往往面临困难。若领域内标注数据稀缺或与真实测试分布不匹配,直接微调可能具有脆弱性。上下文学习提供了一种无需训练、推理时适配的解决方案,通过基于少量领域内演示示例对听觉大语言模型进行条件化调整。在本工作中,我们首先证明,$\textit{朴素上下文学习}$能够提升所选模型在多种语音与音频任务上的零样本性能,这表明这种上下文学习适配能力可推广至多模态场景。在此基础上,我们提出$\textbf{元语音上下文学习(MetaSICL)}$,一种利用来自不同任务的高资源语音数据来增强模型上下文学习能力的后训练方案。实验表明,在低资源场景下,我们提出的方法优于直接微调。