While Large Audio-Language Models (LALMs) have been shown to exhibit degraded instruction-following capabilities, their ability to infer task patterns from in-context examples under audio conditioning remains unstudied. To address this gap, we present ALICE, a three-stage framework that progressively reduces textual guidance to systematically evaluate LALMs' in-context learning ability under audio conditioning. Evaluating six LALMs across four audio understanding tasks under two output constraint categories, we uncover a consistent asymmetry across all stages and LALMs: in-context demonstrations reliably improve format compliance but fail to improve, and often degrade, the core task performance. This suggests that LALMs can glean surface-level formatting patterns from demonstrations but may struggle to leverage cross-modal semantic grounding to reliably infer task objectives from audio-conditioned examples, highlighting potential limitations in current cross-modal integration.
翻译:尽管大型音频-语言模型(LALMs)在指令遵循能力上已被证实存在退化现象,但其在音频条件下从情境示例中推断任务模式的能力尚未得到研究。为填补这一空白,我们提出ALICE——一种三阶段框架,通过逐步减少文本引导来系统评估LALMs在音频条件下的情境学习能力。我们在两组输出约束类别下,对六种LALMs在四项音频理解任务中进行评估,发现所有阶段与模型均存在一致的非对称性:情境示例可靠地提升了格式合规性,但未能改善甚至损害核心任务性能。这表明LALMs虽能从示例中习得表面格式模式,却难以利用跨模态语义基础从音频条件示例中可靠推断任务目标,揭示了当前跨模态整合能力的潜在局限性。