In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are present in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we establish a novel benchmark to investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We also release a scene-reasoning dataset called "What Are They Doing". Additionally, we provide deeper insights into the models' behaviors by analyzing their dependence on each modality.
翻译:在音频与语音处理领域,任务通常聚焦于单一模态,即便同一音频片段中同时存在环境声音与人类语音。近期出现的听觉大语言模型(ALLMs)使得在单一模型内同时处理音频与语音成为可能,这进一步推动了联合音频-语音任务的研究。本文建立了一个新颖的基准测试,以探究ALLMs在联合音频-语音处理方面的性能。具体而言,我们提出了联合音频-语音协同推理(JASCO)这一新任务,该任务统一了音频与语音处理,并严格要求跨双模态的协同推理。我们还发布了一个名为“他们在做什么”的场景推理数据集。此外,通过分析模型对各模态的依赖程度,我们为理解模型行为提供了更深入的见解。