Large audio-language models (LALMs) increasingly use explicit reasoning traces for complex audio understanding, yet the evaluation of reasoning quality remains underexplored. Although process-level benchmarks for process reward models (PRMs) have advanced reasoning evaluation in text and multi-modal domains, comparable evaluation for audio reasoning remains limited. In this paper, we present AudioProcessBench, a comprehensive benchmark for step-level process error identification in audio reasoning. AudioProcessBench contains diverse reasoning traces generated by 6 audio and omni language models. Each trace is segmented into discrete reasoning steps and annotated with binary step correctness and fine-grained error types. Our benchmark evaluates models under three complementary paradigms: (1) step correctness identification, (2) error-type-conditioned detection for diagnosing audio-specific verifier capacities, and (3) chain-level aggregation, where verifiers select or aggregate among multiple reasoning traces for the same question. This design enables a systematic analysis of whether current models can detect process errors, whether their weaknesses differ across audio-specific error types, and whether process verification translates into improved answer selection. AudioProcessBench provides a testbed for future research on audio reasoning verifiers, process reward models, and reliable omni-modal reasoning.
翻译:大型音频语言模型(LALMs)日益依赖显式推理轨迹进行复杂音频理解,然而推理质量的评估仍鲜有探索。尽管针对过程奖励模型(PRMs)的过程级基准已推动文本和多模态领域的推理评估,但音频领域的可比评估仍十分有限。本文提出AudioProcessBench,一个面向音频推理中步骤级过程错误识别的综合性基准。该基准包含由6种音频及全模态语言模型生成的多样化推理轨迹,每条轨迹被分割为离散推理步骤,并标注了二值步骤正确性及细粒度错误类型。我们的基准在三种互补范式下评估模型:(1)步骤正确性识别;(2)基于错误类型的条件检测,用于诊断音频专用验证器能力;(3)链级聚合,即验证器对同一问题的多条推理轨迹进行选择或聚合。该设计支持系统分析:当前模型能否检测过程错误?其弱点在不同音频专用错误类型间是否存在差异?过程验证能否转化为更优答案选择?AudioProcessBench为未来音频推理验证器、过程奖励模型及可靠全模态推理研究提供了测试平台。