The integration of Federated Learning (FL) and Self-supervised Learning (SSL) offers a unique and synergetic combination to exploit the audio data for general-purpose audio understanding, without compromising user data privacy. However, rare efforts have been made to investigate the SSL models in the FL regime for general-purpose audio understanding, especially when the training data is generated by large-scale heterogeneous audio sources. In this paper, we evaluate the performance of feature-matching and predictive audio-SSL techniques when integrated into large-scale FL settings simulated with non-independently identically distributed (non-iid) data. We propose a novel Federated SSL (F-SSL) framework, dubbed FASSL, that enables learning intermediate feature representations from large-scale decentralized heterogeneous clients, holding unlabelled audio data. Our study has found that audio F-SSL approaches perform on par with the centralized audio-SSL approaches on the audio-retrieval task. Extensive experiments demonstrate the effectiveness and significance of FASSL as it assists in obtaining the optimal global model for state-of-the-art FL aggregation methods.
翻译:联邦学习与自监督学习的结合为利用音频数据进行通用音频理解提供了一种独特且协同的范式,同时不损害用户数据隐私。然而,在联邦学习框架下,针对通用音频理解的自监督学习模型研究仍十分有限,尤其是当训练数据由大规模异构音频源生成时。本文评估了特征匹配与预测式音频自监督学习技术在基于非独立同分布数据模拟的大规模联邦学习场景中的性能。我们提出了一种新颖的联邦自监督学习框架——FASSL,该框架能够从持有无标签音频数据的大规模去中心化异构客户端中学习中间特征表示。研究表明,在音频检索任务中,联邦自监督学习方法性能与集中式音频自监督学习方法相当。大量实验证明了FASSL的有效性与重要性,因为它能够为当前最先进的联邦学习聚合方法获取最优全局模型提供支撑。