The dominant paradigm for end-to-end robot learning focuses on optimizing task-specific objectives that solve a single robotic problem such as picking up an object or reaching a target position. However, recent work on high-capacity models in robotics has shown promise toward being trained on large collections of diverse and task-agnostic datasets of video demonstrations. These models have shown impressive levels of generalization to unseen circumstances, especially as the amount of data and the model complexity scale. Surgical robot systems that learn from data have struggled to advance as quickly as other fields of robot learning for a few reasons: (1) there is a lack of existing large-scale open-source data to train models, (2) it is challenging to model the soft-body deformations that these robots work with during surgery because simulation cannot match the physical and visual complexity of biological tissue, and (3) surgical robots risk harming patients when tested in clinical trials and require more extensive safety measures. This perspective article aims to provide a path toward increasing robot autonomy in robot-assisted surgery through the development of a multi-modal, multi-task, vision-language-action model for surgical robots. Ultimately, we argue that surgical robots are uniquely positioned to benefit from general-purpose models and provide three guiding actions toward increased autonomy in robot-assisted surgery.
翻译:端到端机器人学习的主流范式侧重于优化特定任务目标,以解决单一机器人问题,例如抓取物体或到达目标位置。然而,近期对机器人领域高容量模型的研究表明,此类模型有望在多样化、任务无关的大规模视频演示数据集上进行训练。这些模型展现出对未见情境的显著泛化能力,尤其是在数据量和模型复杂度提升时尤为突出。基于数据学习的手术机器人系统在进展速度上落后于机器人学习的其他领域,主要原因有三:(1)缺乏现有的大规模开源数据来训练模型;(2)手术中机器人操作的软体变形难以建模,因为模拟无法匹配生物组织的物理与视觉复杂性;(3)手术机器人在临床试验中可能对患者造成伤害,需要更严格的安全措施。本前瞻性文章旨在通过开发面向手术机器人的多模态、多任务、视觉-语言-动作模型,为提升机器人辅助手术的自主性提供路径。最终,我们认为手术机器人在受益于通用基础模型方面具有独特优势,并提出三项指导性行动以增强机器人辅助手术的自主性。