The energy and latency costs of deep neural network inference are increasingly driven by deployment rather than training, motivating hardware-specialized alternatives to arithmetic-heavy models. Field-Programmable Gate Arrays (FPGAs) provide an attractive substrate for such specialization, yet existing FPGA-based neural approaches are fragmented and difficult to compare. We present BitLogic, a fully gradient-based, end-to-end trainable framework for FPGA-native neural networks built around Lookup Table (LUT) computation. BitLogic replaces multiply-accumulate operations with differentiable LUT nodes that map directly to FPGA primitives, enabling native binary computation, sparse connectivity, and efficient hardware realization. The framework offers a modular functional API supporting diverse architectures, along with learned encoders, hardware-aware heads, and multiple boundary-consistent LUT relaxations. An automated Register Transfer Level (RTL) export pipeline translates trained PyTorch models into synthesizable HDL, ensuring equivalence between software and hardware inference. Experiments across standard vision benchmarks and heterogeneous hardware platforms demonstrate competitive accuracy and substantial gains in FPGA efficiency, including 72.3% test accuracy on CIFAR-10 achieved with fewer than 0.3M logic gates, while attaining sub-20 ns single-sample inference using only LUT resources.
翻译:深度神经网络推理的能耗与延迟成本日益受到部署而非训练环节的驱动,这促使人们寻求替代计算密集型模型的硬件专用方案。现场可编程门阵列(FPGA)为此类专用化提供了理想平台,然而现有基于FPGA的神经网络方案较为分散且难以横向比较。本文提出BitLogic——一个围绕查找表(LUT)计算构建的、完全基于梯度且端到端可训练的FPGA原生神经网络框架。BitLogic采用可微分LUT节点替代传统乘累加运算,这些节点可直接映射至FPGA原语,从而实现原生二进制计算、稀疏连接与高效硬件实现。该框架提供支持多样化架构的模块化函数API,同时包含可学习的编码器、硬件感知头部及多种边界一致性LUT松弛方法。通过自动化寄存器传输级(RTL)导出流水线,训练完成的PyTorch模型可转换为可综合硬件描述语言,确保软件与硬件推理的等效性。在标准视觉基准测试与异构硬件平台上的实验表明,该框架在保持竞争力的准确率同时,显著提升了FPGA效率:在CIFAR-10数据集上以不足30万逻辑门实现72.3%测试准确率,并仅使用LUT资源即达成单样本推理低于20纳秒的延迟。