We consider supervised learning with $n$ labels and show that layerwise SGD on residual networks can efficiently learn a class of hierarchical models. This model class assumes the existence of an (unknown) label hierarchy $L_1 \subseteq L_2 \subseteq \dots \subseteq L_r = [n]$, where labels in $L_1$ are simple functions of the input, while for $i > 1$, labels in $L_i$ are simple functions of simpler labels. Our class surpasses models that were previously shown to be learnable by deep learning algorithms, in the sense that it reaches the depth limit of efficient learnability. That is, there are models in this class that require polynomial depth to express, whereas previous models can be computed by log-depth circuits. Furthermore, we suggest that learnability of such hierarchical models might eventually form a basis for understanding deep learning. Beyond their natural fit for domains where deep learning excels, we argue that the mere existence of human ``teachers" supports the hypothesis that hierarchical structures are inherently available. By providing granular labels, teachers effectively reveal ``hints'' or ``snippets'' of the internal algorithms used by the brain. We formalize this intuition, showing that in a simplified model where a teacher is partially aware of their internal logic, a hierarchical structure emerges that facilitates efficient learnability.
翻译:我们考虑具有 $n$ 个标签的监督学习问题,并证明在残差网络上进行逐层随机梯度下降(SGD)能够高效学习一类分层模型。该模型类假设存在一个(未知的)标签层次结构 $L_1 \subseteq L_2 \subseteq \dots \subseteq L_r = [n]$,其中 $L_1$ 中的标签是输入的简单函数,而对于 $i > 1$,$L_i$ 中的标签是更简单标签的简单函数。我们的模型类超越了先前被证明可由深度学习算法学习的模型,其意义在于它达到了高效可学习性的深度极限。也就是说,该模型类中存在需要多项式深度才能表达的模型,而先前的模型可通过对数深度电路计算。此外,我们认为此类分层模型的可学习性最终可能为理解深度学习提供基础。除了它们天然适用于深度学习表现出色的领域之外,我们认为人类“教师”的单纯存在就支持了层次结构本身可用的假设。通过提供细粒度的标签,教师有效地揭示了大脑所用内部算法的“提示”或“片段”。我们将这一直觉形式化,表明在一个简化模型中,当教师部分了解其内部逻辑时,会出现一种促进高效可学习性的层次结构。