Multimodal foundation models have demonstrated impressive generalization capabilities, yet efficiently adapting them to new tasks in a few-shot setting remains a critical challenge. In this work, we investigate the few-shot adaptation of Large Audio-Language Models (ALMs) through both training-based and training-free approaches. We introduce MUKA, a multi-kernel adaptation framework that combines the fine-grained, context-dependent representations of instruction-tuning based models like Pengi with the global semantic representations of contrastive pretraining methods like CLAP. By constructing a product kernel that aligns local similarity with global semantics, MUKA enhances representational power while preserving the theoretical guarantees of kernel methods and avoiding additional training. Extensive experiments across 11 diverse audio datasets demonstrate that MUKA achieves state-of-the-art performance among training-free methods and even surpasses training-based adapters in several scenarios, offering a compelling balance between adaptability and efficiency.
翻译:多模态基础模型已展现出卓越的泛化能力,然而在少样本场景下将其高效适应新任务仍是一个关键挑战。本研究通过基于训练与免训练两种方法,探索大型音频-语言模型的少样本适应问题。我们提出MUKA——一种多核适应框架,该框架将基于指令微调的模型(如Pengi)的细粒度上下文相关表征,与对比预训练方法(如CLAP)的全局语义表征相结合。通过构建对齐局部相似性与全局语义的乘积核,MUKA在保持核方法理论保证且无需额外训练的前提下,增强了表征能力。在11个多样化音频数据集上的大量实验表明,MUKA在免训练方法中实现了最先进的性能,并在多个场景中超越了基于训练的适配器,为适应性与效率提供了理想的平衡。