The information exponent (Ben Arous et al. [2021]) -- which is equivalent to the lowest degree in the Hermite expansion of the link function for Gaussian single-index models -- has played an important role in predicting the sample complexity of online stochastic gradient descent (SGD) in various learning tasks. In this work, we demonstrate that, for multi-index models, focusing solely on the lowest degree can miss key structural details of the model and result in suboptimal rates. Specifically, we consider the task of learning target functions of form $f_*(\mathbf{x}) = \sum_{k=1}^{P} \phi(\mathbf{v}_k^* \cdot \mathbf{x})$, where $P \ll d$, the ground-truth directions $\{ \mathbf{v}_k^* \}_{k=1}^P$ are orthonormal, and only the second and $2L$-th Hermite coefficients of the link function $\phi$ can be nonzero. Based on the theory of information exponent, when the lowest degree is $2L$, recovering the directions requires $d^{2L-1}\mathrm{poly}(P)$ samples, and when the lowest degree is $2$, only the relevant subspace (not the exact directions) can be recovered due to the rotational invariance of the second-order terms. In contrast, we show that by considering both second- and higher-order terms, we can first learn the relevant space via the second-order terms, and then the exact directions using the higher-order terms, and the overall sample and complexity of online SGD is $d \mathrm{poly}(P)$.
翻译:信息指数(Ben Arous 等人 [2021])——对于高斯单索引模型,它等价于链接函数的 Hermite 展开中的最低次数——在预测在线随机梯度下降(SGD)在各种学习任务中的样本复杂度方面发挥了重要作用。在这项工作中,我们证明,对于多索引模型,仅关注最低次数可能会遗漏模型的关键结构细节,并导致次优的收敛速率。具体而言,我们考虑学习形式为 $f_*(\mathbf{x}) = \sum_{k=1}^{P} \phi(\mathbf{v}_k^* \cdot \mathbf{x})$ 的目标函数的任务,其中 $P \ll d$,真实方向 $\{ \mathbf{v}_k^* \}_{k=1}^P$ 是正交归一的,且链接函数 $\phi$ 仅其二阶和 $2L$ 阶 Hermite 系数可能非零。根据信息指数理论,当最低次数为 $2L$ 时,恢复方向需要 $d^{2L-1}\mathrm{poly}(P)$ 个样本;而当最低次数为 $2$ 时,由于二阶项的旋转不变性,只能恢复相关子空间(而非精确方向)。与之相反,我们证明,通过同时考虑二阶项和高阶项,我们可以首先通过二阶项学习相关空间,然后利用高阶项恢复精确方向,并且在线 SGD 的总体样本和计算复杂度为 $d \mathrm{poly}(P)$。