Imitation learning has achieved remarkable success in robotic manipulation, yet its application to surgical robotics remains challenging due to data scarcity, constrained workspaces, and the need for an exceptional level of safety and predictability. We present a supervised Mixture-of-Experts (MoE) architecture designed for phase-structured surgical manipulation tasks, which can be added on top of any autonomous policy. Unlike prior surgical robot learning approaches that rely on multi-camera setups or thousands of demonstrations, we show that a lightweight action decoder policy like Action Chunking Transformer (ACT) can learn complex, long-horizon manipulation from less than 150 demonstrations using solely stereo endoscopic images, when equipped with our architecture. We evaluate our approach on the collaborative surgical task of bowel grasping and retraction, where a robot assistant interprets visual cues from a human surgeon, executes targeted grasping on deformable tissue, and performs sustained retraction. We benchmark our method against state-of-the-art Vision-Language-Action (VLA) models and the standard ACT baseline. Our results show that generalist VLAs fail to acquire the task entirely, even under standard in-distribution conditions. Furthermore, while standard ACT achieves moderate success in-distribution, adopting a supervised MoE architecture significantly boosts its performance, yielding higher success rates in-distribution and demonstrating superior robustness in out-of-distribution scenarios, including novel grasp locations, reduced illumination, and partial occlusions. Notably, it generalizes to unseen testing viewpoints and also transfers zero-shot to ex vivo porcine tissue without additional training, offering a promising pathway toward in vivo deployment. To support this, we present qualitative preliminary results of policy roll-outs during in vivo porcine surgery.
翻译:模仿学习在机器人操作领域已取得显著成功,但其在手术机器人中的应用仍面临挑战,原因包括数据稀缺、工作空间受限以及对安全性和可预测性的极高要求。我们提出了一种专为分阶段结构手术操作任务设计的监督式专家混合模型架构,该架构可集成于任何自主策略之上。与以往依赖多摄像头设置或数千次演示的手术机器人学习方法不同,我们证明,当配备我们的架构时,一种轻量级动作解码器策略(如动作分块Transformer)仅需不到150次演示,即可仅凭立体内窥镜图像学习复杂的长时程操作。我们在肠道抓持与牵拉这一协作式手术任务上评估了我们的方法,其中机器人助手解读人类外科医生的视觉线索,对可变形组织执行目标抓持,并进行持续牵拉。我们将本方法与最先进的视觉-语言-动作模型及标准动作分块Transformer基线进行了对比。结果表明,即使在标准分布内条件下,通用视觉-语言-动作模型也完全无法掌握该任务。此外,虽然标准动作分块Transformer在分布内条件下取得了中等程度的成功,但采用监督式专家混合架构显著提升了其性能,在分布内实现了更高的成功率,并在分布外场景(包括新颖抓持位置、照明减弱和部分遮挡)中表现出更优的鲁棒性。值得注意的是,该方法能泛化至未见过的测试视角,并可零样本迁移至离体猪组织而无需额外训练,为体内部署提供了可行路径。为此,我们展示了在活体猪手术期间策略执行的初步定性结果。