Vision-Language Navigation VLN requires large-scale trajectory instruction data from private indoor environments, raising significant privacy concerns. Federated Learning FL mitigates this by keeping data on-device, but vanilla FL struggles under VLNs' extreme cross-client heterogeneity in environments and instruction styles, making a single global model suboptimal. This paper proposes pFedNavi, a structure-aware and dynamically adaptive personalized federated learning framework tailored for VLN. Our key idea is to personalize where it matters: pFedNavi adaptively identifies client-specific layers via layer-wise mixing coefficients, and performs fine-grained parameter fusion on the selected components (e.g., the encoder-decoder projection and environment-sensitive decoder layers) to balance global knowledge sharing with local specialization. We evaluate pFedNavi on two standard VLN benchmarks, R2R and RxR, using both ResNet and CLIP visual representations. Across all metrics, pFedNavi consistently outperforms the FedAvg-based VLN baseline, achieving up to 7.5% improvement in navigation success rate and up to 7.8% gain in trajectory fidelity, while converging 1.38x faster under non-IID conditions.
翻译:视觉语言导航(VLN)需要来自私有室内环境的大规模轨迹-指令数据,这引发了严重的隐私担忧。联邦学习(FL)通过将数据保留在设备端来缓解此问题,但传统的FL在VLN中面临环境与指令风格上极端的跨客户端异构性挑战,导致单一的全局模型性能欠佳。本文提出了pFedNavi,一个专为VLN设计的结构感知、动态自适应的个性化联邦学习框架。我们的核心思想是在关键处实现个性化:pFedNavi通过分层混合系数自适应地识别客户端特定的层,并对选定的组件(例如编码器-解码器投影层和环境敏感的解码器层)执行细粒度参数融合,以平衡全局知识共享与本地专业化。我们在两个标准VLN基准测试(R2R和RxR)上评估pFedNavi,并使用了ResNet和CLIP两种视觉表示。在所有评估指标上,pFedNavi均持续优于基于FedAvg的VLN基线,在导航成功率上实现了高达7.5%的提升,在轨迹保真度上获得了高达7.8%的增益,同时在非独立同分布条件下收敛速度加快了1.38倍。