依赖学习理论中的锐利收敛率：避免平方损失下的样本规模缩减 (Sharp Rates in Dependent Learning Theory: Avoiding Sample Size Deflation for the Square Loss)

In this work, we study statistical learning with dependent ($\beta$-mixing) data and square loss in a hypothesis class $\mathscr{F}\subset L_{\Psi_p}$ where $\Psi_p$ is the norm $\|f\|_{\Psi_p} \triangleq \sup_{m\geq 1} m^{-1/p} \|f\|_{L^m} $ for some $p\in [2,\infty]$. Our inquiry is motivated by the search for a sharp noise interaction term, or variance proxy, in learning with dependent data. Absent any realizability assumption, typical non-asymptotic results exhibit variance proxies that are deflated multiplicatively by the mixing time of the underlying covariates process. We show that whenever the topologies of $L^2$ and $\Psi_p$ are comparable on our hypothesis class $\mathscr{F}$ -- that is, $\mathscr{F}$ is a weakly sub-Gaussian class: $\|f\|_{\Psi_p} \lesssim \|f\|_{L^2}^\eta$ for some $\eta\in (0,1]$ -- the empirical risk minimizer achieves a rate that only depends on the complexity of the class and second order statistics in its leading term. Our result holds whether the problem is realizable or not and we refer to this as a \emph{near mixing-free rate}, since direct dependence on mixing is relegated to an additive higher order term. We arrive at our result by combining the above notion of a weakly sub-Gaussian class with mixed tail generic chaining. This combination allows us to compute sharp, instance-optimal rates for a wide range of problems. Examples that satisfy our framework include sub-Gaussian linear regression, more general smoothly parameterized function classes, finite hypothesis classes, and bounded smoothness classes.

翻译：在本研究中，我们探讨了依赖数据（$\beta$混合）条件下、假设类$\mathscr{F}\subset L_{\Psi_p}$中平方损失的统计学习问题，其中$\Psi_p$范数定义为$\|f\|_{\Psi_p} \triangleq \sup_{m\geq 1} m^{-1/p} \|f\|_{L^m}$（$p\in [2,\infty]$）。本研究的动机在于探索依赖数据学习中噪声交互项（即方差代理）的锐利边界。在没有任何可实现性假设的情况下，典型的非渐近结果所呈现的方差代理会因协变量过程的混合时间而产生乘性缩减。我们证明，只要$L^2$与$\Psi_p$拓扑在假设类$\mathscr{F}$上具有可比性——即$\mathscr{F}$是弱次高斯类：存在$\eta\in (0,1]$使得$\|f\|_{\Psi_p} \lesssim \|f\|_{L^2}^\eta$——经验风险最小化器所达到的收敛率就仅由假设类的复杂度及其主导项的二阶统计量决定。无论问题是否可实现，该结论均成立，我们称之为近无混合速率，因为对混合性的直接依赖被降级为加性的高阶项。我们通过将弱次高斯类概念与混合尾泛链化方法相结合得出了这一结论。这种结合使我们能够为广泛的问题计算锐利的、实例最优的收敛率。满足我们框架的示例包括次高斯线性回归、更一般的平滑参数化函数类、有限假设类以及有界光滑类。