Visual Question Answering (VQA) based on multi-modal data facilitates real-life applications such as home robots and medical diagnoses. One significant challenge is to devise a robust decentralized learning framework for various client models where centralized data collection is refrained due to confidentiality concerns. This work aims to tackle privacy-preserving VQA by decoupling a multi-modal model into representation modules and a contrastive module and leveraging inter-module gradients sharing and inter-client weight sharing. To this end, we propose Bidirectional Contrastive Split Learning (BiCSL) to train a global multi-modal model on the entire data distribution of decentralized clients. We employ the contrastive loss that enables a more efficient self-supervised learning of decentralized modules. Comprehensive experiments are conducted on the VQA-v2 dataset based on five SOTA VQA models, demonstrating the effectiveness of the proposed method. Furthermore, we inspect BiCSL's robustness against a dual-key backdoor attack on VQA. Consequently, BiCSL shows much better robustness to the multi-modal adversarial attack compared to the centralized learning method, which provides a promising approach to decentralized multi-modal learning.
翻译:基于多模态数据的视觉问答(VQA)在家庭机器人和医疗诊断等实际应用中具有重要价值。其中一个关键挑战是如何设计一个鲁棒的分散式学习框架,以适应不同客户端模型,其中由于保密性问题而避免集中式数据收集。本文旨在通过将多模态模型解耦为表示模块和对比模块,并利用模块间梯度共享和客户端间权重共享来解决隐私保护的VQA问题。为此,我们提出双向对比分割学习(BiCSL)方法,在分散式客户端的整体数据分布上训练全局多模态模型。我们采用对比损失函数,使分散式模块能够进行更高效的自监督学习。基于五种最新SOTA VQA模型在VQA-v2数据集上进行了全面实验,验证了所提方法的有效性。此外,我们考察了BiCSL对VQA双密钥后门攻击的鲁棒性。结果表明,与集中式学习方法相比,BiCSL对多模态对抗攻击表现出更强的鲁棒性,为分散式多模态学习提供了一种有前景的途径。