Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data-often scraped from the internet. This data can still be sensitive and empirical evidence suggests that SSL encoders memorize private information of their training data and can disclose them at inference time. Since existing theoretical definitions of memorization from supervised learning rely on labels, they do not transfer to SSL. To address this gap, we propose SSLMem, a framework for defining memorization within SSL. Our definition compares the difference in alignment of representations for data points and their augmented views returned by both encoders that were trained on these data points and encoders that were not. Through comprehensive empirical analysis on diverse encoder architectures and datasets we highlight that even though SSL relies on large datasets and strong augmentations-both known in supervised learning as regularization techniques that reduce overfitting-still significant fractions of training data points experience high memorization. Through our empirical results, we show that this memorization is essential for encoders to achieve higher generalization performance on different downstream tasks.
翻译:自监督学习(SSL)近期因其能够纯粹基于无标签数据(通常从互联网抓取)训练高性能编码器而备受关注。然而,这些数据仍可能包含敏感信息,且实证证据表明SSL编码器会记忆训练数据中的隐私信息,并在推理阶段泄露这些信息。由于现有监督学习中基于标签的记忆化理论定义无法直接迁移到SSL场景,我们提出了SSLMem框架,用于定义SSL中的记忆化现象。该定义通过比较编码器对其训练数据点及对应增强视图的表示一致性差异,与未训练该数据点的编码器的差异进行对照。通过对不同编码器架构和数据集的全面实证分析,我们揭示:尽管SSL依赖大规模数据集和强数据增强(监督学习中这些技术通常被视为降低过拟合的正则化手段),但仍有大量训练数据点存在显著记忆化现象。实证结果表明,这种记忆化对于编码器在不同下游任务中实现更高泛化性能至关重要。