The empirical success of deep learning is often attributed to deep networks' ability to exploit hierarchical structure in data, constructing increasingly complex features across layers. Yet despite substantial progress in deep learning theory, most optimization results sill focus on networks with only two or three layers, leaving the theoretical understanding of hierarchical learning in genuinely deep models limited. This leads to a natural question: can we prove that deep networks, trained by gradient-based methods, can efficiently exploit hierarchical structure? In this work, we consider Random Hierarchy Models -- a hierarchical context-free grammar introduced by arXiv:2307.02129 and conjectured to separate deep and shallow networks. We prove that, under mild conditions, a deep convolutional network can be efficiently trained to learn this function class. Our proof builds on a general observation: if intermediate layers can receive clean signal from the labels and the relevant features are weakly identifiable, then layerwise training each individual layer suffices to hierarchically learn the target function.
翻译:深度学习的实证成功常归因于深度网络能够利用数据中的层次结构,在层间构建日益复杂的特征。然而,尽管深度学习理论取得了显著进展,大多数优化结果仍聚焦于仅含两到三层的网络,导致对真正深度模型中层次学习的理论理解依然有限。这引出了一个自然的问题:我们能否证明基于梯度方法训练的深度网络能够高效利用层次结构?在本工作中,我们考虑随机层次模型——一种由arXiv:2307.02129引入并被推测可区分深度与浅层网络的层次化上下文无关文法。我们证明,在温和条件下,深度卷积网络可通过高效训练学习此类函数。我们的证明基于一个普遍观察:若中间层能从标签接收清晰信号且相关特征具有弱可识别性,则逐层训练每个独立层足以层次化地学习目标函数。