RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering

Conversational generative AI is increasingly explored in healthcare, where models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio recordings captured with sensing devices offer a scalable route to screening and longitudinal monitoring, but heterogeneity is particularly acute: recordings vary across devices, environments, and acquisition protocols, and queries may vary in intent, answer format, and prediction objective. Existing biomedical audio-language question answering systems for respiratory assessment are starting to emerge, but they are typically built as single-path models, processing all inputs through the same acoustic and language pathway despite variation in recording conditions and query types. They are also usually evaluated in relatively limited settings, leaving open their robustness under realistic distribution shifts, including changes in acquisition domains, modality, and clinical task. To address this gap, we introduce RAMoEA-QA, the first RA QA model designed to support input-dependent specialization across heterogeneous recordings and query types within a unified hierarchical two-stage framework. We study this design in a unified RA QA setting spanning clinical and self-recorded, multi-device acquisition settings, question formats, and both discrete and continuous targets. Across in-domain and controlled-shift evaluations, RAMoEA-QA improves over matched monolithic baselines and routing controls, reaching 0.72 in in-domain test accuracy (vs. 0.61 and 0.67 for single-path baselines) on discriminative tasks, while also achieving the best regression performance and stronger average transfer under dataset, modality, and task shifts, including gains of up to 23 percentage points in accuracy on the COPD modality-shift setting.

翻译：对话式生成式人工智能在医疗领域的应用日益广泛，此类模型需整合异质性的患者信号，支持多样化的交互风格，同时产出具有临床意义的输出结果。在呼吸护理中，使用传感设备捕获的非侵入性音频记录为筛查和纵向监测提供了可扩展的途径，但异质性尤为突出：不同设备、环境和采集协议下的记录存在差异，且查询在意图、答案格式和预测目标上也可能各不相同。现有的用于呼吸评估的生物医学音频-语言问答系统虽已开始涌现，但通常构建为单路径模型，即无论记录条件和查询类型如何变化，均通过相同的声学与语言通道处理所有输入。此外，这些系统通常仅在相对有限的设置下进行评估，其在现实分布偏移（包括采集域、模态和临床任务的变化）下的鲁棒性仍有待探究。为弥补这一不足，我们提出了RAMoEA-QA，这是首个旨在统一的层级化两阶段框架中支持针对异质性记录和查询类型进行输入依赖型专业分工的呼吸音频问答模型。我们在统一的呼吸音频问答设置中研究该设计，该设置涵盖临床和自记录、多设备采集场景、问题格式以及离散与连续目标。在域内和受控偏移评估中，RAMoEA-QA在判别性任务上优于匹配的单体基线和路由控制模型，域内测试准确率达到0.72（单路径基线分别为0.61和0.67）。同时，其在回归性能上亦达到最优，且在数据集、模态和任务偏移下表现出更强的平均迁移能力，其中在慢性阻塞性肺疾病模态偏移场景中准确率提升高达23个百分点。