PrivFedTalk: Privacy-Aware Federated Diffusion with Identity-Stable Adapters for Personalized Talking-Head Generation

Talking-head generation has advanced rapidly with diffusion-based generative models, but training usually depends on centralized face-video and speech datasets, raising major privacy concerns. The problem is more acute for personalized talking-head generation, where identity-specific data are highly sensitive and often cannot be pooled across users or devices. PrivFedTalk is presented as a privacy-aware federated framework for personalized talking-head generation that combines conditional latent diffusion with parameter-efficient identity adaptation. A shared diffusion backbone is trained across clients, while each client learns lightweight LoRA identity adapters from local private audio-visual data, avoiding raw data sharing and reducing communication cost. To address heterogeneous client distributions, Identity-Stable Federated Aggregation (ISFA) weights client updates using privacy-safe scalar reliability signals computed from on-device identity consistency and temporal stability estimates. Temporal-Denoising Consistency (TDC) regularization is introduced to reduce inter-frame drift, flicker, and identity drift during federated denoising. To limit update-side privacy risk, secure aggregation and client-level differential privacy are applied to adapter updates. The implementation supports both low-memory GPU execution and multi-GPU client-parallel training on heterogeneous shared hardware. Comparative experiments on the present setup across multiple training and aggregation conditions with PrivFedTalk, FedAvg, and FedProx show stable federated optimization and successful end-to-end training and evaluation under constrained resources. The results support the feasibility of privacy-aware personalized talking-head training in federated environments, while suggesting that stronger component-wise, privacy-utility, and qualitative claims need further standardized evaluation.

翻译：基于扩散模型的说话头生成技术发展迅速，但训练通常依赖集中式人脸视频与语音数据集，引发了重大隐私担忧。这一问题在个性化说话头生成中尤为突出，其中身份特定数据高度敏感，通常无法跨用户或设备汇集。本文提出PrivFedTalk，一种面向个性化说话头生成的隐私感知联邦框架，该框架将条件潜在扩散与参数高效的身份适配相结合。共享扩散主干网络在客户端间进行训练，而每个客户端则从本地私有的音视频数据中学习轻量级LoRA身份适配器，从而避免原始数据共享并降低通信成本。为应对异构客户端分布，身份稳定联邦聚合（ISFA）利用从设备端身份一致性与时序稳定性估计中计算出的隐私安全标量可靠性信号，对客户端更新进行加权。时序去噪一致性（TDC）正则化被引入以减少联邦去噪过程中的帧间漂移、闪烁及身份漂移。为限制更新侧的隐私风险，对适配器更新应用了安全聚合与客户端级别差分隐私。所实现方法既支持低内存GPU执行，也支持异构共享硬件上的多GPU客户端并行训练。在多种训练与聚合条件下，基于PrivFedTalk、FedAvg与FedProx的对比实验表明，该方法实现了稳定的联邦优化，并在资源受限情况下成功完成端到端训练与评估。研究结果支持在联邦环境中进行隐私感知的个性化说话头训练的可行性，同时也表明更强有力的组件级、隐私-效用及定性评估需进一步标准化验证。