Iterative differential approximation methods that rely upon backpropagation have enabled the optimization of neural networks; however, at present, they remain computationally expensive, especially when training models at scale. In this paper, we propose a computationally efficient alternative for optimizing neural networks that can both reduce the costs of scaling neural networks and provide high-efficiency optimizations for low-resource applications. We derive an explicit solution to a simple feed-forward language model (LM) by mathematically analyzing its gradients. This solution generalizes from single-layer LMs to the class of all single-layer feed-forward softmax-activated neural models trained on positive-valued features, as is demonstrated by our extension of this solution application to MNIST digit classification. For both LM and digit classifiers, we find computationally that explicit solutions perform near-optimality in experiments showing that 1) iterative optimization only marginally improves the explicit solution parameters and 2) randomly initialized parameters iteratively optimize towards the explicit solution. We also preliminarily apply the explicit solution locally by layer in multi-layer networks and discuss how the solution's computational savings increase with model complexity -- for both single- and mult-layer applications of the explicit solution, we emphasize that the optima achieved cannot be reached by backpropagation alone, i.e., better optima appear discoverable only after explicit solutions are applied. Finally, we discuss the solution's computational savings alongside its impact on model interpretability and suggest future directions for the derivation of explicit solutions to complex- and multi-layer architectures.
翻译:依赖于反向传播的迭代微分近似方法实现了神经网络的优化;然而,目前这些方法计算开销依然很大,尤其是在大规模训练模型时。本文提出一种计算高效的神经网络优化替代方案,既能降低扩展神经网络的计算成本,又能为低资源应用提供高效优化。通过数学分析梯度,我们推导出一个简单前馈语言模型的显式解。该解从单层语言模型推广至所有在正值特征上训练的单层前馈softmax激活神经网络模型,我们将其应用于MNIST手写数字分类的扩展实验证明了这一结论。针对语言模型和数字分类器,实验发现显式解可实现近最优性能:1)迭代优化仅能略微改善显式解参数,2)随机初始化参数会通过迭代优化向显式解收敛。我们还初步将显式解逐层应用于多层网络,并讨论了该解的计算节省如何随模型复杂度增加而提升——无论是单层还是多层的显式解应用,我们强调其达到的最优解无法仅通过反向传播实现,即显式解应用后才可能发现更优解。最后,我们讨论了该解的计算节省及其对模型可解释性的影响,并提出了推导复杂与多层架构显式解的未来研究方向。