面向GPU的网格计算局部性感知自动微分系统 (Locality-Aware Automatic Differentiation on the GPU for Mesh-Based Computations)

We present a GPU-based system for automatic differentiation (AD) of functions defined on triangle meshes, designed to exploit the locality and sparsity in mesh-based computation. Our system evaluates derivatives using per-element forward-mode AD, confining all computation to registers and shared memory and assembling global gradients, sparse Jacobians, and sparse Hessians directly on the GPU. By avoiding global computation graphs, intermediate buffers, and device-host synchronization, our approach minimizes memory traffic and enables efficient differentiation under both static and dynamically changing sparsity. Our programming model lets users express energy terms over mesh neighborhoods, while our system automatically manages parallel execution, derivative propagation, sparse assembly, and matrix-free operations such as Hessian-vector products. Our system supports both scalar and vector-valued objectives, dynamic interaction-driven sparsity updates, and seamless integration with external GPU sparse linear solvers. We evaluate our system on applications including elastic and cloth simulation, surface parameterization, mesh smoothing, frame field design, ARAP deformation, and spherical manifold optimization. Across these tasks, our system consistently outperforms state-of-the-art differentiation frameworks, including PyTorch, JAX, Warp, Dr.JIT, and Thallo. We demonstrate speedups across a range of solver types, from Newton and Gauss-Newton for nonlinear least squares to L-BFGS and gradient descent, and across different derivative usage modes, including Hessian-vector products as well as full sparse Hessian and Jacobian construction.

翻译：本文提出一种基于GPU的自动微分系统，专门用于处理三角形网格上定义的函数，旨在充分利用网格计算中的局部性与稀疏性特性。该系统采用基于单元的前向模式自动微分进行导数计算，将所有运算限制在寄存器与共享内存中，并直接在GPU上完成全局梯度、稀疏雅可比矩阵及稀疏海森矩阵的组装。通过规避全局计算图、中间缓冲区以及设备-主机同步操作，本方法显著减少了内存流量，并能在静态与动态变化的稀疏性条件下实现高效微分。我们的编程模型允许用户在网格邻域上定义能量项，而系统自动管理并行执行、导数传播、稀疏矩阵组装以及无矩阵运算（如海森-向量乘积）。本系统支持标量与向量值目标函数、动态交互驱动的稀疏性更新，并能与外部GPU稀疏线性求解器无缝集成。我们在弹性与布料模拟、曲面参数化、网格平滑、帧场设计、ARAP形变及球面流形优化等应用中评估了系统性能。在所有测试任务中，本系统均持续优于当前最先进的微分框架，包括PyTorch、JAX、Warp、Dr.JIT和Thallo。我们展示了在多种求解器类型（从非线性最小二乘的牛顿法和高斯-牛顿法到L-BFGS及梯度下降法）以及不同微分使用模式（包括海森-向量乘积与完整稀疏海森矩阵及雅可比矩阵构建）中均能实现显著的加速效果。