The surgical intervention is crucial to patient healthcare, and many studies have developed advanced algorithms to provide understanding and decision-making assistance for surgeons. Despite great progress, these algorithms are developed for a single specific task and scenario, and in practice require the manual combination of different functions, thus limiting the applicability. Thus, an intelligent and versatile surgical assistant is expected to accurately understand the surgeon's intentions and accordingly conduct the specific tasks to support the surgical process. In this work, by leveraging advanced multimodal large language models (MLLMs), we propose a Versatile Surgery Assistant (VS-Assistant) that can accurately understand the surgeon's intention and complete a series of surgical understanding tasks, e.g., surgical scene analysis, surgical instrument detection, and segmentation on demand. Specifically, to achieve superior surgical multimodal understanding, we devise a mixture of projectors (MOP) module to align the surgical MLLM in VS-Assistant to balance the natural and surgical knowledge. Moreover, we devise a surgical Function-Calling Tuning strategy to enable the VS-Assistant to understand surgical intentions, and thus make a series of surgical function calls on demand to meet the needs of the surgeons. Extensive experiments on neurosurgery data confirm that our VS-Assistant can understand the surgeon's intention more accurately than the existing MLLM, resulting in overwhelming performance in textual analysis and visual tasks. Source code and models will be made public.
翻译:手术干预对患者医疗至关重要,许多研究已开发出先进算法为外科医生提供理解与决策支持。尽管取得显著进展,但这些算法仅针对单一特定任务和场景开发,实际应用中需手动组合不同功能,从而限制了其适用性。因此,亟需一种智能且通用的手术助手,能够准确理解外科医生意图并执行相应任务以支持手术流程。本研究利用先进的多模态大语言模型(MLLMs),提出一种多功能手术助手(VS-Assistant),可精准理解外科医生意图并完成一系列手术理解任务,例如手术场景分析、手术器械检测及按需分割。具体而言,为实现卓越的手术多模态理解,我们在VS-Assistant中设计了混合投影器(MOP)模块,用于对齐手术MLLM,以平衡自然知识与手术专业知识。此外,我们提出手术函数调用微调策略,使VS-Assistant能够理解手术意图,从而根据外科医生需求按需调用一系列手术功能。在神经外科数据上的大量实验证实,我们的VS-Assistant相比现有MLLM能更准确理解外科医生意图,在文本分析及视觉任务中均展现出压倒性性能。源代码与模型将公开提供。