Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss

In this work, we study statistical learning with dependent ($\beta$-mixing) data and square loss in a hypothesis class $\mathscr{F}\subset L_{\Psi_p}$ where $\Psi_p$ is the norm $\|f\|_{\Psi_p} \triangleq \sup_{m\geq 1} m^{-1/p} \|f\|_{L^m} $ for some $p\in [2,\infty]$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent data. Absent any realizability assumption, typical non-asymptotic results exhibit variance proxies that are deflated \emph{multiplicatively} by the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$ -- that is, $\mathscr{F}$ is a weakly sub-Gaussian class: $\|f\|_{\Psi_p} \lesssim \|f\|_{L^2}^\eta$ for some $\eta\in (0,1]$ -- the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. Our result holds whether the problem is realizable or not and we refer to this as a \emph{near mixing-free rate}, since direct dependence on mixing is relegated to an additive higher order term. We arrive at our result by combining the above notion of a weakly sub-Gaussian class with mixed tail generic chaining. This combination allows us to compute sharp, instance-optimal rates for a wide range of problems. %Our approach, reliant on mixed tail generic chaining, allows us to obtain sharp, instance-optimal rates. Examples that satisfy our framework include sub-Gaussian linear regression, more general smoothly parameterized function classes, finite hypothesis classes, and bounded smoothness classes.

翻译：本文研究依赖（$\beta$-混合）数据和平方损失在假设类$\mathscr{F}\subset L_{\Psi_p}$下的统计学习，其中$\Psi_p$为范数$\|f\|_{\Psi_p} \triangleq \sup_{m\geq 1} m^{-1/p} \|f\|_{L^m} $，参数$p\in [2,\infty]$。我们的研究动机源于寻找依赖数据学习中的锐利噪声交互项（即方差代理）。在缺乏任何可实现性假设的情况下，典型的非渐近结果表明方差代理会因底层协变量过程的混合时间而\emph{乘法性地}缩减。我们证明：当假设类$\mathscr{F}$上的$L^2$拓扑与$\Psi_p$拓扑具有可比性时——即$\mathscr{F}$为弱次高斯类：存在$\eta\in (0,1]$使得$\|f\|_{\Psi_p} \lesssim \|f\|_{L^2}^\eta$——经验风险最小化器的领先项仅依赖于类的复杂性和二阶统计量。无论是在可实现还是不可实现的情况下，该结论均成立。我们将此称为\emph{近无混合速率}，因为对混合的直接依赖被归入加性高阶项。通过将上述弱次高斯类概念与混合尾通用链式技术相结合，我们得到了这一结果。这种结合使我们能够为广泛问题计算锐利且实例最优的速率。%我们的方法依赖于混合尾通用链式技术，从而能够获得锐利且实例最优的速率。满足我们框架的实例包括次高斯线性回归、更一般的平滑参数化函数类、有限假设类以及有界光滑性类。