In surgical skill assessment, the Objective Structured Assessments of Technical Skills (OSATS) and Global Rating Scale (GRS) are well-established tools for evaluating surgeons during training. These metrics, along with performance feedback, help surgeons improve and reach practice standards. Recent research on the open-source JIGSAWS dataset, which includes both GRS and OSATS labels, has focused on regressing GRS scores from kinematic data, video, or their combination. However, we argue that regressing GRS alone is limiting, as it aggregates OSATS scores and overlooks clinically meaningful variations during a surgical trial. To address this, we developed a recurrent transformer model that tracks a surgeon's performance throughout a session by mapping hidden states to six OSATS, derived from kinematic data, using a clinically motivated objective function. These OSATS scores are averaged to predict GRS, allowing us to compare our model's performance against state-of-the-art (SOTA) methods. We report Spearman's Correlation Coefficients (SCC) demonstrating that our model outperforms SOTA using kinematic data (SCC 0.83-0.88), and matches performance with video-based models. Our model also surpasses SOTA in most tasks for average OSATS predictions (SCC 0.46-0.70) and specific OSATS (SCC 0.56-0.95). The generation of pseudo-labels at the segment level translates quantitative predictions into qualitative feedback, vital for automated surgical skill assessment pipelines. A senior surgeon validated our model's outputs, agreeing with 77% of the weakly-supervised predictions (p=0.006).
翻译:在手术技能评估中,客观结构化技术技能评估(OSATS)和全局评分量表(GRS)是用于评估培训中医生的成熟工具。这些指标与绩效反馈相结合,有助于外科医生改进并达到实践标准。近期针对开源JIGSAWS数据集(同时包含GRS和OSATS标签)的研究,主要集中于从运动学数据、视频或其组合中回归GRS分数。然而,我们认为仅回归GRS具有局限性,因为它聚合了OSATS分数,并忽略了手术试验期间具有临床意义的变异。为解决此问题,我们开发了一种循环Transformer模型,该模型通过使用临床驱动的目标函数,将隐藏状态映射到从运动学数据推导出的六个OSATS分数,从而在整个会话过程中跟踪外科医生的表现。这些OSATS分数被平均以预测GRS,使我们能够将模型性能与最先进(SOTA)方法进行比较。我们报告的斯皮尔曼相关系数(SCC)表明,我们的模型在使用运动学数据时优于SOTA方法(SCC 0.83-0.88),并与基于视频的模型性能相当。在大多数任务中,我们的模型在平均OSATS预测(SCC 0.46-0.70)和特定OSATS预测(SCC 0.56-0.95)方面也超越了SOTA。在片段级别生成伪标签将定量预测转化为定性反馈,这对于自动化手术技能评估流程至关重要。一位资深外科医生验证了我们模型的输出,对77%的弱监督预测表示同意(p=0.006)。