As conversational multimodal AI tools are increasingly adopted to process patient data for health assessment, robust benchmarks are needed to measure progress and expose failure modes under realistic conditions. Despite the importance of respiratory audio for mobile health screening, respiratory audio question answering remains underexplored, with existing studies evaluated narrowly and lacking real-world heterogeneity across modalities, devices, and question types. We hence introduce the Respiratory-Audio Question-Answering (RA-QA) benchmark, including a standardized data generation pipeline, a comprehensive multimodal QA collection, and a unified evaluation protocol. RA-QA harmonizes public RA datasets into a collection of 9 million format-diverse QA pairs covering diagnostic and contextual attributes. We benchmark classical ML baselines alongside multimodal audio-language models, establishing reproducible reference points and showing how current approaches fail under heterogeneity.
翻译:随着对话式多模态人工智能工具日益广泛地应用于处理患者数据以进行健康评估,亟需建立稳健的基准来度量进展,并在真实条件下暴露其失效模式。尽管呼吸音频对于移动健康筛查至关重要,但呼吸音频问答领域的研究仍显不足,现有研究评估范围狭窄,且缺乏跨模态、设备和问题类型的真实世界异质性。为此,我们提出了呼吸音频问答基准,包含一个标准化的数据生成流程、一个全面的多模态问答集合以及一个统一的评估协议。该基准整合了公开的呼吸音频数据集,构建了一个包含900万个格式多样的问答对的集合,涵盖诊断与上下文属性。我们对经典机器学习基线以及多模态音频-语言模型进行了基准测试,建立了可复现的参考点,并揭示了现有方法在异质性条件下的失效情况。