RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering

Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts.

翻译：对话式生成式人工智能正迅速进入医疗保健领域，其中通用模型必须整合异构的患者信号、支持多样化的交互方式，并产生具有临床意义的输出。在呼吸护理领域，非侵入性音频（例如通过移动设备麦克风捕获的录音）使得可扩展的筛查和纵向监测成为可能，但异质性挑战尤为严峻：录音因设备、环境和采集协议的不同而差异巨大，且问题涵盖多种意图和问答格式。现有的生物医学音频-语言问答系统通常是单一的整体模型，缺乏针对多样化呼吸语料库和查询意图的专业化机制。它们也仅在有限场景下得到验证，其在实际应用中处理各类分布偏移的可靠性尚不明确。为应对这些局限，我们提出了RAMoEA-QA，一个用于呼吸音频问答的层次化路由生成模型，它在一个统一的多模态系统内整合了多种问题类型，并支持离散和连续目标。RAMoEA-QA采用两阶段条件专业化：音频专家混合层将每条录音路由至合适的预训练音频编码器，而语言适配器混合层则在共享的冻结大语言模型上选择匹配查询意图和答案格式的LoRA适配器。通过对每个示例同时进行声学表征和生成行为的专业化，RAMoEA-QA以最小的参数量开销持续优于强基线模型及路由消融模型，将领域内测试准确率提升至0.72（相较于最先进基线的0.61和0.67），并在领域、模态和任务偏移下展现出最强的诊断泛化能力。