NeuronFabric: A Software Reference Architecture for On-Chip Transformer Training with Local Adam

Publicly documented accelerator architectures generally separate training computation from optimizer-state updates or rely on external memory and host orchestration. This paper presents NeuronFabric, a software reference architecture intended for future FPGA and ASIC implementations of transformer training with local Adam updates. A complete C# prototype implements forward pass, backpropagation, and Adam optimization without external machine-learning frameworks. The goal is to validate numerical correctness and memory requirements before hardware implementation. The evaluated model is a 334K-parameter autoregressive transformer (d=88, H=4, f=264, L=4, vocab=256) trained on the Shakespeare corpus. The BF16W configuration achieves evaluation loss 1.5426 after 80K samples, compared with 1.5224 for an FP32 GPU reference, while producing coherent character-level text. The paper introduces BF16W, which stores weights in BF16 while retaining Adam optimizer moments in FP32. This reduces memory requirements for on-chip training. A 334K-parameter FP32 model with Adam moments requires approximately 4.0 MB, matching the BRAM capacity of a Xilinx ZCU102 device. The BF16W variant requires approximately 3.34 MB, leaving memory available for activation storage. We describe the vocabulary-budget constraint observed during earlier experiments, quantify BF16W memory savings, and outline FPGA training as the next stage of development. No FPGA measurements are included in this paper. This publication serves as a public architectural disclosure and software reference implementation for future FPGA and ASIC exploration of the NeuronFabric architecture.

翻译：公开文献记载的加速器架构通常将训练计算与优化器状态更新分离，或依赖外部存储与主机编排。本文提出NeuronFabric，这是一种面向未来FPGA和ASIC实现具有局部Adam更新的Transformer训练的软件参考架构。完整的C#原型实现了前向传播、反向传播及Adam优化，无需外部机器学习框架。其目标是在硬件实现前验证数值正确性与存储需求。评估模型为一个334K参数的自回归Transformer（d=88, H=4, f=264, L=4, vocab=256），在莎士比亚语料库上训练。BF16W配置在80K样本后达到评估损失1.5426，而FP32 GPU参考值为1.5224，同时生成连贯的字符级文本。本文引入BF16W方案：权重以BF16存储，而Adam优化器矩采用FP32保留，从而降低片上训练的存储需求。一个含Adam矩的334K参数FP32模型约需4.0 MB，匹配Xilinx ZCU102器件的BRAM容量；BF16W变体约需3.34 MB，剩余存储可用于激活值。我们描述了早期实验中观察到的词汇量预算约束，量化了BF16W的存储节省，并概述了FPGA训练作为下一开发阶段。本文未包含FPGA测量数据。本出版物旨在为未来NeuronFabric架构的FPGA与ASIC探索提供公开架构说明及软件参考实现。