Does Learning from Decentralized Non-IID Unlabeled Data Benefit from Self Supervision?

Decentralized learning has been advocated and widely deployed to make efficient use of distributed datasets, with an extensive focus on supervised learning (SL) problems. Unfortunately, the majority of real-world data are unlabeled and can be highly heterogeneous across sources. In this work, we carefully study decentralized learning with unlabeled data through the lens of self-supervised learning (SSL), specifically contrastive visual representation learning. We study the effectiveness of a range of contrastive learning algorithms under decentralized learning settings, on relatively large-scale datasets including ImageNet-100, MS-COCO, and a new real-world robotic warehouse dataset. Our experiments show that the decentralized SSL (Dec-SSL) approach is robust to the heterogeneity of decentralized datasets, and learns useful representation for object classification, detection, and segmentation tasks. This robustness makes it possible to significantly reduce communication and reduce the participation ratio of data sources with only minimal drops in performance. Interestingly, using the same amount of data, the representation learned by Dec-SSL can not only perform on par with that learned by centralized SSL which requires communication and excessive data storage costs, but also sometimes outperform representations extracted from decentralized SL which requires extra knowledge about the data labels. Finally, we provide theoretical insights into understanding why data heterogeneity is less of a concern for Dec-SSL objectives, and introduce feature alignment and clustering techniques to develop a new Dec-SSL algorithm that further improves the performance, in the face of highly non-IID data. Our study presents positive evidence to embrace unlabeled data in decentralized learning, and we hope to provide new insights into whether and why decentralized SSL is effective.

翻译：去中心化学习已被倡导并广泛部署，以高效利用分布式数据集，其研究重点主要集中于监督学习问题。然而，现实世界中的大多数数据都是无标注的，且不同数据源之间可能存在高度异质性。本研究通过自监督学习的视角，特别是对比视觉表示学习，仔细探讨了无标注数据下的去中心化学习。我们研究了多种对比学习算法在去中心化学习设置中的有效性，涉及ImageNet-100、MS-COCO等相对大规模数据集，以及一个新的真实世界仓库机器人数据集。实验表明，去中心化自监督学习方法对去中心化数据集的异质性具有鲁棒性，并能学习到对目标分类、检测和分割任务有用的表示。这种鲁棒性使其能够显著降低通信开销并减少数据源参与比例，而性能仅出现极小下降。有趣的是，在使用相同数据量的情况下，去中心化自监督学习学到的表示不仅可与需要通信和大量数据存储成本的中心化自监督学习相媲美，有时甚至优于需要额外数据标签知识的去中心化监督学习所提取的表示。最后，我们从理论层面揭示了为何数据异质性对去中心化自监督学习目标的影响较小，并引入特征对齐与聚类技术，开发了一种新的去中心化自监督学习算法，能在高度非独立同分布数据环境下进一步提升性能。本研究为在去中心化学习中利用无标注数据提供了积极证据，并期望为去中心化自监督学习的有效性与内在机理提供新见解。