Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data-often scraped from the internet. This data can still be sensitive and empirical evidence suggests that SSL encoders memorize private information of their training data and can disclose them at inference time. Since existing theoretical definitions of memorization from supervised learning rely on labels, they do not transfer to SSL. To address this gap, we propose SSLMem, a framework for defining memorization within SSL. Our definition compares the difference in alignment of representations for data points and their augmented views returned by both encoders that were trained on these data points and encoders that were not. Through comprehensive empirical analysis on diverse encoder architectures and datasets we highlight that even though SSL relies on large datasets and strong augmentations-both known in supervised learning as regularization techniques that reduce overfitting-still significant fractions of training data points experience high memorization. Through our empirical results, we show that this memorization is essential for encoders to achieve higher generalization performance on different downstream tasks.
翻译:自监督学习(SSL)因其能够仅利用未标注数据(通常从互联网抓取)训练高性能编码器而近期受到广泛关注。此类数据仍可能包含敏感信息,且实证证据表明SSL编码器会记忆其训练数据的私有信息,并在推理时可能泄露这些信息。由于监督学习中现有的记忆化理论定义依赖于标签,这些定义无法直接迁移至SSL。为填补这一空白,我们提出了SSLMem——一个用于定义SSL中记忆化的框架。我们的定义通过比较两种编码器返回的数据点及其增强视图的表征对齐差异来实现:一种是在这些数据点上训练过的编码器,另一种是未在此数据上训练的编码器。通过对多种编码器架构和数据集的综合实证分析,我们指出:尽管SSL依赖大规模数据集和强数据增强(这两者在监督学习中被视为减少过拟合的正则化技术),但仍有相当比例的训练数据点经历了高度记忆化。我们的实证结果表明,这种记忆化对于编码器在不同下游任务上实现更高泛化性能至关重要。