An increasing number of researchers are finding use for nth-order gradient computations for a wide variety of applications, including graphics, meta-learning (MAML), scientific computing, and most recently, implicit neural representations (INRs). Recent work shows that the gradient of an INR can be used to edit the data it represents directly without needing to convert it back to a discrete representation. However, given a function represented as a computation graph, traditional architectures face challenges in efficiently computing its nth-order gradient due to the higher demand for computing power and higher complexity in data movement. This makes it a promising target for FPGA acceleration. In this work, we introduce INR-Arch, a framework that transforms the computation graph of an nth-order gradient into a hardware-optimized dataflow architecture. We address this problem in two phases. First, we design a dataflow architecture that uses FIFO streams and an optimized computation kernel library, ensuring high memory efficiency and parallel computation. Second, we propose a compiler that extracts and optimizes computation graphs, automatically configures hardware parameters such as latency and stream depths to optimize throughput, while ensuring deadlock-free operation, and outputs High-Level Synthesis (HLS) code for FPGA implementation. We utilize INR editing as our benchmark, presenting results that demonstrate 1.8-4.8x and 1.5-3.6x speedup compared to CPU and GPU baselines respectively. Furthermore, we obtain 3.1-8.9x and 1.7-4.3x lower memory usage, and 1.7-11.3x and 5.5-32.8x lower energy-delay product. Our framework will be made open-source and available on GitHub.
翻译:越来越多的研究人员发现,n阶梯度计算在图形学、元学习(MAML)、科学计算以及最近的隐式神经表示(INR)等多种应用中具有重要价值。近期研究表明,INR的梯度可用于直接编辑其所表示的数据,而无需将其转换回离散表示。然而,对于表示为计算图的函数,传统架构在高效计算其n阶梯度时面临挑战,原因在于对计算能力的需求更高且数据移动的复杂性更大。这使得它成为FPGA加速的一个有前景的目标。在本工作中,我们引入了INR-Arch,这是一个将n阶梯度的计算图转化为硬件优化的数据流架构的框架。我们分两个阶段解决这个问题。首先,我们设计了一种数据流架构,该架构使用FIFO流和优化的计算内核库,确保高内存效率和并行计算。其次,我们提出了一种编译器,该编译器提取和优化计算图,自动配置延迟和流深度等硬件参数以优化吞吐量,同时确保无死锁操作,并输出用于FPGA实现的高层次综合(HLS)代码。我们以INR编辑作为基准测试,结果显示,与CPU和GPU基线相比,分别实现了1.8-4.8倍和1.5-3.6倍的加速。此外,我们的内存使用量降低了3.1-8.9倍和1.7-4.3倍,能量延迟积降低了1.7-11.3倍和5.5-32.8倍。我们的框架将开源并在GitHub上提供。