ALICE: A Multifaceted Evaluation Framework of Large Audio-Language Models' In-Context Learning Ability

While Large Audio-Language Models (LALMs) have been shown to exhibit degraded instruction-following capabilities, their ability to infer task patterns from in-context examples under audio conditioning remains unstudied. To address this gap, we present ALICE, a three-stage framework that progressively reduces textual guidance to systematically evaluate LALMs' in-context learning ability under audio conditioning. Evaluating six LALMs across four audio understanding tasks under two output constraint categories, we uncover a consistent asymmetry across all stages and LALMs: in-context demonstrations reliably improve format compliance but fail to improve, and often degrade, the core task performance. This suggests that LALMs can glean surface-level formatting patterns from demonstrations but may struggle to leverage cross-modal semantic grounding to reliably infer task objectives from audio-conditioned examples, highlighting potential limitations in current cross-modal integration.

翻译：尽管大型音频-语言模型（LALMs）在指令遵循能力上已被证实存在退化现象，但其在音频条件下从情境示例中推断任务模式的能力尚未得到研究。为填补这一空白，我们提出ALICE——一种三阶段框架，通过逐步减少文本引导来系统评估LALMs在音频条件下的情境学习能力。我们在两组输出约束类别下，对六种LALMs在四项音频理解任务中进行评估，发现所有阶段与模型均存在一致的非对称性：情境示例可靠地提升了格式合规性，但未能改善甚至损害核心任务性能。这表明LALMs虽能从示例中习得表面格式模式，却难以利用跨模态语义基础从音频条件示例中可靠推断任务目标，揭示了当前跨模态整合能力的潜在局限性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【AAAI2025】通过自适应多方面检索增强，利用大型语言模型进行知识图谱问答

专知会员服务

31+阅读 · 2024年12月26日

MME-Survey：多模态大型语言模型评估的综合性调查

专知会员服务

43+阅读 · 2024年12月1日

《多模态大语言模型评估综述》

专知会员服务

41+阅读 · 2024年8月29日

【AAAI2024教程】在规划中大型语言模型的作用，181页ppt

专知会员服务

79+阅读 · 2024年2月22日