Hierarchical Transformer Preconditioning for Interactive Physics Simulation

Neural preconditioners for real-time physics simulation offer promising data-driven priors, but they often fail to capture long-range couplings efficiently because they inherit local message passing or sparse-operator access patterns. We introduce the Hierarchical Transformer Preconditioner, a neural preconditioner anchored to a weak-admissibility H-matrix partition. The partition provides a multiscale structural prior (dense diagonal leaves plus coarsening off-diagonal tiles) that enables full-graph approximate-inverse computation with O(N) scaling at fixed block sizes. The network models the inverse through low-rank far-field factors and uses highway connections (axial buffers plus a global summary token) to propagate context across transformer depth. At each PCG iteration, preconditioner application reduces to batched dense GEMMs with regular memory access. The key training contribution is a cosine-Hutchinson probe objective that learns the action of MA on convergence-critical spectral subspaces, optimizing angular alignment of MAz with z rather than forcing eigenvalue clusters to a prescribed location. This removes unnecessary spectral-placement constraints from SAI-style objectives and improves conditioning on irregular spectra. Because both inference and apply are dense, dependency-free tensor programs, the full solve loop is captured as a single CUDA Graph. On stiff multiphase Poisson systems (up to 100:1 density contrast, N = 1,024-16,384), the solver runs from ~143 to ~21 fps. At N = 8,192, it reaches 17.9 ms/frame, with 2.2x speedup over GPU Jacobi, ~28x over GPU IC/DILU (AMGX multicolor_dilu), and 2.7x over neural SPAI retrained per scale on the same benchmark.

翻译：神经预条件器为实时物理模拟提供了有前景的数据驱动先验，但由于其继承了局部消息传递或稀疏算子访问模式，往往难以高效捕获长程耦合。我们提出层次化Transformer预条件器，这是一种基于弱可接受性H-矩阵分区的神经网络预条件器。该分区提供多尺度结构先验（稠密对角叶子块加粗化非对角瓦片），支持在固定块大小下以O(N)复杂度实现全图近似逆计算。网络通过低秩远场因子建模逆矩阵，并利用高速公路连接（轴向缓冲器加全局摘要令牌）在Transformer深度间传播上下文。每次PCG迭代中，预条件器应用简化为具有规则内存访问的分批稠密广义矩阵乘法运算。关键训练贡献在于余弦-哈钦森探测目标函数，该函数学习M⁻¹A在收敛关键谱子空间上的作用，优化M⁻¹A与z之间的角度对准，而非强制特征值簇至指定位置。这消除了SAI式目标函数中不必要的谱定位约束，并改善了非规则谱上的条件数。由于推理和应用均为稠密且无依赖的张量程序，完整求解循环可封装为单个CUDA图。在刚性多相泊松系统（密度对比度达100:1，N=1,024-16,384）上，求解器运行速度约为143至21帧/秒。当N=8,192时，达到17.9毫秒/帧，相比GPU雅可比迭代加速2.2倍，相比GPU IC/DILU（AMGX multicolor_dilu）加速约28倍，相比同级基准上逐尺度重训练的神经SPAI加速2.7倍。