We study the data-generating mechanism for reconstructive SSL to shed light on its effectiveness. With an infinite amount of labeled samples, we provide a sufficient and necessary condition for perfect linear approximation. The condition reveals a full-rank component that preserves the label classes of Y, along with a redundant component. Motivated by the condition, we propose to approximate the redundant component by a low-rank factorization and measure the approximation quality by introducing a new quantity $\epsilon_s$, parameterized by the rank of factorization s. We incorporate $\epsilon_s$ into the excess risk analysis under both linear regression and ridge regression settings, where the latter regularization approach is to handle scenarios when the dimension of the learned features is much larger than the number of labeled samples n for downstream tasks. We design three stylized experiments to compare SSL with supervised learning under different settings to support our theoretical findings.
翻译:我们研究重建式自监督学习的数据生成机制,以阐明其有效性。在拥有无限量标注样本的条件下,我们给出了完美线性逼近的充分必要条件。该条件揭示了一个保留Y标签类别的满秩分量以及一个冗余分量。受该条件启发,我们提出通过低秩分解来近似冗余分量,并通过引入由分解秩s参数化的新量$\epsilon_s$来衡量近似质量。我们将$\epsilon_s$纳入线性回归和岭回归设定下的超额风险分析中,后者采用正则化方法以应对下游任务中学习特征的维度远大于标注样本数n的情形。我们设计了三个典型实验,在不同设定下比较自监督学习与监督学习,以支持我们的理论发现。