Contrastive Representation Learning (CRL) has achieved strong empirical success in multiple machine learning disciplines, yet its theoretical sample complexity remains poorly understood. Existing analyses usually assume that input tuples are identically and independently distributed, an assumption violated in most practical settings where contrastive tuples are constructed from a finite pool of labeled data, inducing dependencies among tuples. While one recent work analyzed this learning setting using U-Statistics to estimate the population risk, the techniques used therein require the risk of each class to concentrate uniformly, making excess risk bounds scale in the order of $ρ_{\min}^{-{1}/{2}}$ where $ρ_{\min}$ denotes the probability of the rarest class. Such a dependency can be overly pessimistic in the extreme multiclass settings where there are many tail classes which contribute minimally to the overall population risk. Our contributions are two-fold. Firstly, we improve upon the previous work and prove a bound with a sample complexity of the same order as the number of classes $R$, regardless of the distribution over classes. Furthermore, we formulate a different estimator that captures the concentration of the risk \textit{across classes}, enabling sharper bounds in extreme multi-class learning scenarios, especially where class distributions are long-tailed. Under mild assumptions on the class distributions, the resulting sample complexity is $\mathcal{O}(k)$ where $k$ is the number of samples per tuple.
翻译:对比表示学习(CRL)在多个机器学习领域中取得了显著的实证成功,但其理论样本复杂度仍未被充分理解。现有分析通常假设输入元组是独立同分布的,而在大多数实际场景中,对比元组是从有限标注数据池中构建的,这违反了该假设并导致元组间存在依赖关系。尽管近期有一项工作利用U-统计量对种群风险进行估计以分析该学习场景,但其使用的技术要求每个类别的风险需要均匀集中,使得超额风险界的尺度为$ρ_{\min}^{-{1}/{2}}$,其中$ρ_{\min}$表示最稀有类别的概率。在极端多类别场景中,这种依赖性可能过于悲观,因为存在许多尾部类别,它们对整体种群风险的贡献极小。我们的贡献有两方面。首先,我们改进了先前工作,证明了一个样本复杂度界,其与类别数量$R$同阶,且与类别分布无关。此外,我们构建了一种不同的估计量,能够刻画风险在类别间的集中性,从而在极端多类别学习场景(尤其是类别分布呈长尾分布时)中获得更紧的界。在类别分布的温和假设下,最终的样本复杂度为$\mathcal{O}(k)$,其中$k$为每个元组中的样本数量。