We study the data-generating mechanism for reconstructive SSL to shed light on its effectiveness. With an infinite amount of labeled samples, we provide a sufficient and necessary condition for perfect linear approximation. The condition reveals a full-rank component that preserves the label classes of Y, along with a redundant component. Motivated by the condition, we propose to approximate the redundant component by a low-rank factorization and measure the approximation quality by introducing a new quantity $\epsilon_s$, parameterized by the rank of factorization s. We incorporate $\epsilon_s$ into the excess risk analysis under both linear regression and ridge regression settings, where the latter regularization approach is to handle scenarios when the dimension of the learned features is much larger than the number of labeled samples n for downstream tasks. We design three stylized experiments to compare SSL with supervised learning under different settings to support our theoretical findings.
翻译:本研究旨在探究重构式自监督学习的数据生成机制,以揭示其有效性根源。在拥有无限标记样本的理想条件下,我们提出了完美线性近似的充分必要条件。该条件表明存在一个保持标签类别Y信息的满秩分量,以及一个冗余分量。受此条件启发,我们提出通过低秩分解来近似冗余分量,并引入由分解秩s参数化的新度量$\epsilon_s$来衡量近似质量。我们将$\epsilon_s$纳入线性回归和岭回归场景下的超额风险分析框架,其中岭回归正则化方法专门用于处理下游任务中学习特征维度远大于标记样本数n的情形。为验证理论发现,我们设计了三组典型实验,在不同设置下系统比较自监督学习与监督学习的性能表现。