It is well known that the class of rotation invariant algorithms are suboptimal even for learning sparse linear problems when the number of examples is below the "dimension" of the problem. This class includes any gradient descent trained neural net with a fully-connected input layer (initialized with a rotationally symmetric distribution). The simplest sparse problem is learning a single feature out of $d$ features. In that case the classification error or regression loss grows with $1-k/n$ where $k$ is the number of examples seen. These lower bounds become vacuous when the number of examples $k$ reaches the dimension $d$. We show that when noise is added to this sparse linear problem, rotation invariant algorithms are still suboptimal after seeing $d$ or more examples. We prove this via a lower bound for the Bayes optimal algorithm on a rotationally symmetrized problem. We then prove much lower upper bounds on the same problem for simple non-rotation invariant algorithms. Finally we analyze the gradient flow trajectories of many standard optimization algorithms in some simple cases and show how they veer toward or away from the sparse targets. We believe that our trajectory categorization will be useful in designing algorithms that can exploit sparse targets and our method for proving lower bounds will be crucial for analyzing other families of algorithms that admit different classes of invariances.
翻译:众所周知,旋转不变算法类别即使在样本数量低于问题"维度"时,对于学习稀疏线性问题也非最优。该类别包括任何具有全连接输入层(通过旋转对称分布初始化)的梯度下降训练神经网络。最简单的稀疏问题是学习$d$个特征中的单一特征。在此情况下,分类错误或回归损失随$1-k/n$增长,其中$k$为已观测样本数。当下界条件在样本数$k$达到维度$d$时失效。我们证明,当对此稀疏线性问题添加噪声后,旋转不变算法在观测到$d$个或更多样本后仍然非最优。我们通过旋转对称化问题的贝叶斯最优算法下界来证明这一点。随后,我们针对同一问题证明了简单非旋转不变算法的更低上界。最后,我们分析了若干标准优化算法在简单情况下的梯度流轨迹,并展示它们如何偏离或靠近稀疏目标。我们相信,我们的轨迹分类方法将有助于设计能利用稀疏目标的算法,而我们的下界证明方法对于分析其他具有不同不变性类别的算法族至关重要。