Hierarchical Transformer Preconditioning for Interactive Physics Simulation

Neural preconditioners for real-time physics simulation offer promising data-driven priors, but they often fail to capture long-range couplings efficiently because they inherit local message passing or sparse-operator access patterns. We introduce the Hierarchical Transformer Preconditioner, a neural preconditioner anchored to a weak-admissibility H-matrix partition. The partition provides a multiscale structural prior (dense diagonal leaves plus coarsening off-diagonal tiles) that enables full-graph approximate-inverse computation with O(N) scaling at fixed block sizes. The network models the inverse through low-rank far-field factors and uses highway connections (axial buffers plus a global summary token) to propagate context across transformer depth. At each PCG iteration, preconditioner application reduces to batched dense GEMMs with regular memory access. The key training contribution is a cosine-Hutchinson probe objective that learns the action of MA on convergence-critical spectral subspaces, optimizing angular alignment of MAz with z rather than forcing eigenvalue clusters to a prescribed location. This removes unnecessary spectral-placement constraints from SAI-style objectives and improves conditioning on irregular spectra. Because both inference and apply are dense, dependency-free tensor programs, the full solve loop is captured as a single CUDA Graph. On stiff multiphase Poisson systems (up to 100:1 density contrast, N = 1,024-16,384), the solver runs from ~143 to ~21 fps. At N = 8,192, it reaches 17.9 ms/frame, with 2.2x speedup over GPU Jacobi, ~28x over GPU IC/DILU (AMGX multicolor_dilu), and 2.7x over neural SPAI retrained per scale on the same benchmark.

翻译：针对实时物理仿真的神经网络预处理器提供了一种有前景的数据驱动先验，但因其继承局部消息传递或稀疏算子访问模式，往往难以高效捕捉长程耦合。我们提出分层Transformer预处理器——一种锚定于弱可容许性H-矩阵分区的神经网络预处理器。该分区提供多尺度结构先验（密集对角叶子块加粗化非对角瓦片），使得在固定块大小下能以O(N)复杂度实现全图近似逆计算。网络通过低秩远场因子建模逆矩阵，并采用高速公路连接（轴向缓冲器加全局总结令牌）在Transformer深度间传播上下文。每个PCG迭代中，预处理器应用简化为具有规则内存访问的批处理密集通用矩阵乘法。关键训练贡献在于余弦-Hutchinson探针目标函数，该函数学习M⁻¹A在收敛关键谱子空间上的作用，优化M⁻¹Az与z之间的角度对齐而非强制特征值聚簇至预设位置。此举消除了SAI类目标函数中不必要的谱位置约束，并改善了不规则谱上的条件数。由于推理与应用均为无依赖关系的密集张量程序，整个求解循环可被单一CUDA图捕获。在刚性多相泊松系统（密度对比度达100:1，N=1,024-16,384）上，求解器运行帧率约为143至21 fps。当N=8,192时，每帧耗时17.9毫秒，相比GPU雅可比方法加速2.2倍，相比GPU不完全乔列斯基/不完全分解预处理（AMGX multicolor_dilu）加速约28倍，相比在同一基准上按尺度重训练的神经网络SPAI加速2.7倍。