A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

Rafal Kocielnik,J. Everett Knudsen,Steven Y. Cen,Jasmine Lin,Cherine H. Yang,Atharva Deo,Ujjwal Pasupulety,Peter Wager,Anima Anandkumar,Andrew J. Hung

from arxiv, 25 pages, 3 figures

Verbal feedback delivered by attending surgeons in the operating room plays a critical formative role in resident trainee skill acquisition. Yet, assessing the quality of trainer feedback and its effectiveness in influencing trainee behavior during live surgery remains a challenge. Prior studies assessed feedback content relying on extensive manual annotation by expert human raters and focused on developing broad taxonomies that overlook the qualitative aspects of feedback delivery such as clarity or urgency. Limited existing automated methods, including keyword analysis and topic modeling, also fail to capture these nuanced aspects. We introduce a two-stage LLM-based framework that discovers interpretable feedback quality criteria grounded in the context of surgical training. Our method uses multi-agent prompting and surgical domain knowledge injection to discover a small set of human interpretable scoring criteria (e.g., Encouraging, Urgent, Clear). These criteria are then used to automatically score live surgical feedback via an LLM-as-a-judge approach. Evaluation on 4.2k trainer feedback instances demonstrates that our AI-discovered criteria outperform prior content-based frameworks in predicting feedback effectiveness, including observed trainee behavioral adjustments and trainer approval. This work advances scalable, human-aligned assessment of communication quality in the operating room and provides a foundation for improving surgical teaching practices.

翻译：手术室内主治医师提供的口头反馈对住院医师的技能习得具有关键的形成性作用。然而，评估培训师反馈的质量及其在真实手术过程中影响住院医师行为的效果仍是一大挑战。现有研究依赖专家评估者进行大量人工标注来评估反馈内容，并侧重构建宽泛的分类体系，忽略了反馈表达中的质量维度（如清晰度或紧迫性）。有限的自动化方法（包括关键词分析和主题建模）也无法捕捉这些细微特征。我们提出一个基于大语言模型的两阶段框架，用于发现与手术培训场景密切相关且可解释的反馈质量评估标准。该方法通过多智能体提示机制结合手术领域知识注入，自动生成少量可人工解读的评分准则（例如：鼓励性、紧迫性、清晰性）。随后采用大语言模型即裁判方法，依据这些准则对实时手术反馈进行自动评分。在4200条培训师反馈实例上的评估表明，相较于基于内容的现有框架，我们通过人工智能发现的准则在预测反馈有效性（包括受训者的行为调整和培训师认可度）方面表现更优。本研究推动了手术室沟通质量的可扩展、与人类价值观对齐的评估，并为改进外科教学实践奠定了基础。