Explicit Foundation Model Optimization with Self-Attentive Feed-Forward Neural Units

Iterative approximation methods using backpropagation enable the optimization of neural networks, but they remain computationally expensive, especially when used at scale. This paper presents an efficient alternative for optimizing neural networks that reduces the costs of scaling neural networks and provides high-efficiency optimizations for low-resource applications. We will discuss a general result about feed-forward neural networks and then extend this solution to compositional (mult-layer) networks, which are applied to a simplified transformer block containing feed-forward and self-attention layers. These models are used to train highly-specified and complex multi-layer neural architectures that we refer to as self-attentive feed-forward unit (SAFFU) layers, which we use to develop a transformer that appears to generalize well over small, cognitively-feasible, volumes of data. Testing demonstrates explicit solutions outperform models optimized by backpropagation alone. Moreover, further application of backpropagation after explicit solutions leads to better optima from smaller scales of data, training effective models from much less data is enabled by explicit solution warm starts. We then carry out ablation experiments training a roadmap of about 250 transformer models over 1-million tokens to determine ideal settings. We find that multiple different architectural variants produce highly-performant models, and discover from this ablation that some of the best are not the most parameterized. This appears to indicate well-generalized models could be reached using less data by using explicit solutions, and that architectural exploration using explicit solutions pays dividends in guiding the search for efficient variants with fewer parameters, and which could be incorporated into low-resource hardware where AI might be embodied.

翻译：基于反向传播的迭代逼近方法能够优化神经网络，但此类方法计算开销高昂，尤其在大规模应用场景中表现更为突出。本文提出一种高效的神经网络优化替代方案，既能降低网络规模扩展成本，又能为低资源应用提供高效优化方案。我们首先讨论前馈神经网络的通用性结论，随后将该解决方案扩展至复合（多层）网络，并将其应用于包含前馈层与自注意力层的简化Transformer模块中。这些模型被用于训练高度专业化且复杂的多层神经网络架构（我们称之为自注意力前馈单元层，简称SAFFU层），进而开发出能在小规模、认知可行数据量上实现良好泛化的Transformer模型。实验表明，显式求解方法在性能上优于仅依赖反向传播优化的模型。进一步研究表明，在显式求解后追加反向传播优化，可从更小规模数据中获取更优解，通过显式解预热机制可实现基于更少数据训练有效模型。我们随后开展消融实验，基于约100万个token训练了250个Transformer模型（形成路线图），以确定理想配置参数。实验发现多种不同架构变体均能产生高性能模型，并揭示最优模型并非参数规模最大的架构。这表明利用显式解可在更少数据条件下获得良好泛化模型，且基于显式解的架构探索能有效引导我们寻找到参数更少的高效变体，这些变体可集成至可能承载具身人工智能的低资源硬件中。