Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss

In this work, we study statistical learning with dependent ($\beta$-mixing) data and square loss in a hypothesis class $\mathscr{F}\subset L_{\Psi_p}$ where $\Psi_p$ is the norm $\|f\|_{\Psi_p} \triangleq \sup_{m\geq 1} m^{-1/p} \|f\|_{L^m} $ for some $p\in [2,\infty]$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent data. Absent any realizability assumption, typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively by the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$ -- that is, $\mathscr{F}$ is a weakly sub-Gaussian class: $\|f\|_{\Psi_p} \lesssim \|f\|_{L^2}^\eta$ for some $\eta\in (0,1]$ -- the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. Our result holds whether the problem is realizable or not and we refer to this as a \emph{near mixing-free rate}, since direct dependence on mixing is relegated to an additive higher order term. We arrive at our result by combining the above notion of a weakly sub-Gaussian class with mixed tail generic chaining. This combination allows us to compute sharp, instance-optimal rates for a wide range of problems. Examples that satisfy our framework include sub-Gaussian linear regression, more general smoothly parameterized function classes, finite hypothesis classes, and bounded smoothness classes.

翻译：在本研究中，我们探讨了在假设类 $\mathscr{F}\subset L_{\Psi_p}$ 中，使用依赖数据（$\beta$混合）和平方损失进行统计学习的问题，其中 $\Psi_p$ 是范数 $\|f\|_{\Psi_p} \triangleq \sup_{m\geq 1} m^{-1/p} \|f\|_{L^m} $，$p\in [2,\infty]$。我们的研究动机在于寻找依赖数据学习中尖锐的噪声交互项，即方差代理。在没有任何可实现性假设的情况下，典型的非渐近结果所展示的方差代理会因底层协变量过程的混合时间而产生乘性缩减。我们证明，只要 $L^2$ 和 $\Psi_p$ 的拓扑在我们的假设类 $\mathscr{F}$ 上具有可比性——即 $\mathscr{F}$ 是一个弱次高斯类：对于某个 $\eta\in (0,1]$，有 $\|f\|_{\Psi_p} \lesssim \|f\|_{L^2}^\eta$——那么经验风险最小化器所达到的速率，其主项仅依赖于该类的复杂度和其二阶统计量。我们的结果无论问题是否可实现都成立，我们称之为一种近混合无关速率，因为对混合的直接依赖被降级为一个加性的高阶项。我们通过将上述弱次高斯类的概念与混合尾泛链化方法相结合，得出了这一结果。这种结合使我们能够为广泛的问题计算尖锐的、实例最优的速率。满足我们框架的例子包括次高斯线性回归、更一般的平滑参数化函数类、有限假设类以及有界光滑类。