Contrastive learning is a highly successful technique for learning representations of data from labeled tuples, specifying the distance relations within the tuple. We study the sample complexity of contrastive learning, i.e. the minimum number of labeled tuples sufficient for getting high generalization accuracy. We give tight bounds on the sample complexity in a variety of settings, focusing on arbitrary distance functions, both general $\ell_p$-distances, and tree metrics. Our main result is an (almost) optimal bound on the sample complexity of learning $\ell_p$-distances for integer $p$. For any $p \ge 1$ we show that $\tilde \Theta(\min(nd,n^2))$ labeled tuples are necessary and sufficient for learning $d$-dimensional representations of $n$-point datasets. Our results hold for an arbitrary distribution of the input samples and are based on giving the corresponding bounds on the Vapnik-Chervonenkis/Natarajan dimension of the associated problems. We further show that the theoretical bounds on sample complexity obtained via VC/Natarajan dimension can have strong predictive power for experimental results, in contrast with the folklore belief about a substantial gap between the statistical learning theory and the practice of deep learning.
翻译:对比学习是一种从带标签元组中学习数据表示的高效技术,这些元组指定了内部的距离关系。我们研究了对比学习的样本复杂度,即足以获得高泛化精度的带标签元组的最小数量。在多种设定下,我们给出了样本复杂度的紧界,重点关注任意距离函数、一般ℓ_p距离以及树度量。我们的主要结果是针对整数p的ℓ_p距离学习样本复杂度的一个(几乎)最优界。对于任意p ≥ 1,我们证明了学习n点数据集的d维表示需要且仅需要̃Θ(min(nd, n^2))个带标签元组。我们的结果适用于任意输入样本分布,并基于对相应问题的Vapnik-Chervonenkis/Natarajan维度给出对应界。我们进一步表明,通过VC/Natarajan维度获得的样本复杂度理论界对实验结果具有强预测能力,这与关于统计学习理论与深度学习实践之间存在显著差距的传统观点形成对比。