Visual Question Answering (VQA) based on multi-modal data facilitates real-life applications such as home robots and medical diagnoses. One significant challenge is to devise a robust decentralized learning framework for various client models where centralized data collection is refrained due to confidentiality concerns. This work aims to tackle privacy-preserving VQA by decoupling a multi-modal model into representation modules and a contrastive module and leveraging inter-module gradients sharing and inter-client weight sharing. To this end, we propose Bidirectional Contrastive Split Learning (BiCSL) to train a global multi-modal model on the entire data distribution of decentralized clients. We employ the contrastive loss that enables a more efficient self-supervised learning of decentralized modules. Comprehensive experiments are conducted on the VQA-v2 dataset based on five SOTA VQA models, demonstrating the effectiveness of the proposed method. Furthermore, we inspect BiCSL's robustness against a dual-key backdoor attack on VQA. Consequently, BiCSL shows much better robustness to the multi-modal adversarial attack compared to the centralized learning method, which provides a promising approach to decentralized multi-modal learning.
翻译:基于多模态数据的视觉问答(VQA)技术可应用于家庭机器人及医疗诊断等实际场景。核心挑战在于设计鲁棒的分布式学习框架,以应对因隐私顾虑而无法集中收集数据的各类客户端模型。本研究通过将多模态模型解耦为表示模块与对比模块,并利用模块间梯度共享及客户端间权重共享机制,致力于解决隐私保护型VQA问题。为此,我们提出双向对比式拆分学习(BiCSL)方法,可在分布式客户端的完整数据分布上训练全局多模态模型。通过采用对比损失函数,实现了分布式模块更高效的自监督学习。基于VQA-v2数据集,我们在五种最先进的VQA模型上进行了全面实验,验证了所提方法的有效性。此外,我们检验了BiCSL在面对针对VQA的双密钥后门攻击时的鲁棒性。结果表明,与集中式学习方法相比,BiCSL对多模态对抗攻击展现出显著更强的鲁棒性,为分布式多模态学习提供了极具前景的技术路径。