Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a modality-weaving Diagnoser that weaves clinical text with audio tokens via strategic global attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a flow matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for this work, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.

翻译：基于深度学习的呼吸听诊目前面临两大根本性挑战：(i) 固有的信息损失，即将信号转换为频谱图会丢弃瞬态声学事件及临床上下文；(ii) 有限的数据可用性，且因严重的类别不平衡问题而加剧。为弥合这些差距，我们提出了 Resp-Agent，这是一个由新型主动对抗课程智能体（Thinker-A$^2$CA）协调的自主多模态系统。与静态流水线不同，Thinker-A$^2$CA 作为中央控制器，主动识别诊断弱点并在闭环中调度有针对性的合成任务。为解决表征差距，我们引入了一种模态编织诊断器，它通过策略性的全局注意力与稀疏音频锚点，将临床文本与音频令牌编织在一起，从而同时捕获长程临床上下文与毫秒级瞬态信息。为解决数据差距，我们设计了一种流匹配生成器，该生成器通过模态注入适配纯文本大语言模型（LLM），将病理内容与声学风格解耦，以合成难以诊断的样本。作为本工作的基础，我们引入了 Resp-229k，这是一个包含 22.9 万条录音并配以 LLM 提炼的临床叙述的基准语料库。大量实验表明，Resp-Agent 在多种评估场景下均持续优于先前方法，提升了数据稀缺和长尾类别不平衡条件下的诊断鲁棒性。我们的代码和数据可在 https://github.com/zpforlove/Resp-Agent 获取。