The present benchmarks for testing the audio modality of multimodal large language models concentrate on testing various audio tasks such as speaker diarization or gender identification in isolation. Whether a multimodal model can answer the questions that require reasoning skills to combine audio tasks of different categories, cannot be verified with their use. To address this issue, we propose Audio Reasoning Tasks (ART), a new benchmark for assessing the ability of multimodal models to solve problems that require reasoning over audio signal.
翻译:当前用于测试多模态大语言模型音频模态的基准主要集中于孤立地测试各类音频任务,如说话人日志或性别识别。这些基准无法验证多模态模型是否能够回答需要结合不同类别音频任务的推理能力的问题。为解决这一问题,我们提出了音频推理任务(ART),这是一个用于评估多模态模型解决需要音频信号推理能力问题的新基准。