Purpose: Echocardiographic interpretation requires video-level reasoning and guideline-based measurement analysis, which current deep learning models for cardiac ultrasound do not support. We present EchoAgent, a framework that enables structured, interpretable automation for this domain. Methods: EchoAgent orchestrates specialized vision tools under Large Language Model (LLM) control to perform temporal localization, spatial measurement, and clinical interpretation. A key contribution is a measurement-feasibility prediction model that determines whether anatomical structures are reliably measurable in each frame, enabling autonomous tool selection. We curated a benchmark of diverse, clinically validated video-query pairs for evaluation. Results: EchoAgent achieves accurate, interpretable results despite added complexity of spatiotemporal video analysis. Outputs are grounded in visual evidence and clinical guidelines, supporting transparency and traceability. Conclusion: This work demonstrates the feasibility of agentic, guideline-aligned reasoning for echocardiographic video analysis, enabled by task-specific tools and full video-level automation. EchoAgent sets a new direction for trustworthy AI in cardiac ultrasound.
翻译:目的:超声心动图解读需要视频级推理与基于指南的测量分析,而当前用于心脏超声的深度学习模型尚不支持此功能。本文提出EchoAgent框架,为该领域提供结构化、可解释的自动化解决方案。方法:EchoAgent通过大型语言模型(LLM)协调专用视觉工具,实现时间定位、空间测量与临床解读。核心贡献是开发了测量可行性预测模型,用于判断每帧图像中解剖结构是否具备可靠测量条件,从而实现自主工具选择。我们构建了包含多样化、经临床验证的视频-查询对的基准数据集进行评估。结果:尽管时空视频分析增加了复杂性,EchoAgent仍能获得准确且可解释的结果。其输出结果基于视觉证据与临床指南,支持透明度与可追溯性。结论:本研究通过任务专用工具与全视频级自动化,验证了基于智能体的指南对齐推理在超声心动图视频分析中的可行性。EchoAgent为心脏超声领域可信人工智能的发展确立了新方向。