Visual question answering (VQA) is crucial for promoting surgical education. In practice, the needs of trainees are constantly evolving, such as learning more surgical types, adapting to different robots, and learning new surgical instruments and techniques for various surgeries. However, patient data privacy often restricts the availability of old data when updating the model, necessitating an exemplar-free continual learning (CL) setup. Prior CL studies overlooked two vital problems in the surgical domain: 1) large domain shifts from diverse surgical operations collected from multiple sources, and 2) severe data imbalance arising from the uneven presence of surgical instruments or activities. This paper proposes addressing these problems with a multimodal large language model (LLM) and an adaptive weight assignment methodology. We first develop a new multi-teacher CL framework that leverages a multimodal LLM as the additional teacher. The strong generalization ability of the LLM can bridge the knowledge gap when domain shifts and data imbalances occur. We then put forth a novel data processing method that transforms complex LLM embeddings into logits compatible with our CL framework. We further design an adaptive weight assignment approach that balances the generalization ability of the LLM and the domain expertise of the old CL model. Finally, to comprehensively test the effectiveness of our proposed method, we have also constructed two new surgical VQA datasets that are largely different from existing ones and could be valuable resources for future research. Extensive experimental results on the tested datasets demonstrate the superiority of our method to other advanced CL schemes.
翻译:视觉问答(VQA)对于促进外科教育至关重要。在实践中,学员的需求不断演变,例如学习更多手术类型、适应不同机器人,以及学习用于各类手术的新手术器械与技术。然而,患者数据隐私常常限制模型更新时旧数据的可用性,因此需要一种无需示例的持续学习(CL)设置。以往的CL研究忽略了外科领域中的两个关键问题:1)来自多源收集的多样化手术操作所导致的大规模域偏移,以及2)由手术器械或活动出现不均衡引发的严重数据不平衡。本文提出利用多模态大语言模型(LLM)和一种自适应权重分配方法来解决这些问题。我们首先开发了一种新的多教师CL框架,该框架利用多模态LLM作为附加教师。LLM强大的泛化能力能够在发生域偏移和数据不平衡时弥合知识差距。接着,我们提出了一种新颖的数据处理方法,将复杂的LLM嵌入转换为与我们的CL框架兼容的逻辑输出。我们进一步设计了一种自适应权重分配方法,以平衡LLM的泛化能力与旧CL模型的领域专业知识。最后,为了全面测试所提方法的有效性,我们还构建了两个新的外科VQA数据集,它们与现有数据集差异显著,可能成为未来研究的宝贵资源。在测试数据集上的大量实验结果证明了我们的方法相较于其他先进CL方案的优越性。