基于低乘积基数核的截断诺依曼级数快速求值 (Fast Evaluation of Truncated Neumann Series by Low-Product Radix Kernels)

Truncated Neumann series $S_k(A)=I+A+\cdots+A^{k-1}$ are used in approximate matrix inversion and polynomial preconditioning. In dense settings, matrix-matrix products dominate the cost of evaluating $S_k$. Naive evaluation needs $k-1$ products, while splitting methods reduce this to $O(\log k)$. Repeated squaring, for example, uses $2\log_2 k$ products, so further gains require higher-radix kernels that extend the series by $m$ terms per update. Beyond the known radix-5 kernel, explicit higher-radix constructions were not available, and the existence of exact rational kernels was unclear. We construct radix kernels for $T_m(B)=I+B+\cdots+B^{m-1}$ and use them to build faster series algorithms. For radix 9, we derive an exact 3-product kernel with rational coefficients, which is the first exact construction beyond radix 5. This kernel yields $5\log_9 k=1.58\log_2 k$ products, a 21% reduction from repeated squaring. For radix 15, numerical optimization yields a 4-product kernel that matches the target through degree 14 but has nonzero spillover (extra terms) at degrees $\ge 15$. Because spillover breaks the standard telescoping update, we introduce a residual-based radix-kernel framework that accommodates approximate kernels and retains coefficient $(μ_m+2)/\log_2 m$. Within this framework, radix 15 attains $6/\log_2 15\approx 1.54$, the best known asymptotic rate. Numerical experiments support the predicted product-count savings and associated runtime trends.

翻译：截断诺依曼级数 $S_k(A)=I+A+\cdots+A^{k-1}$ 被用于近似矩阵求逆和多项式预条件处理。在稠密矩阵场景下，矩阵-矩阵乘积主导了求值 $S_k$ 的成本。朴素求值需要 $k-1$ 次乘积，而分裂方法可将其降至 $O(\log k)$。例如，重复平方法使用 $2\log_2 k$ 次乘积，因此要获得进一步的增益，需要更高基数的核，使得每次更新能将级数扩展 $m$ 项。除了已知的基数-5 核之外，更高基数的显式构造尚不可用，且是否存在精确的有理数核尚不明确。我们为 $T_m(B)=I+B+\cdots+B^{m-1}$ 构造了基数核，并用它们构建了更快的级数算法。对于基数 9，我们推导出一个具有有理系数的精确 3 乘积核，这是基数 5 之外的首个精确构造。该核可实现 $5\log_9 k=1.58\log_2 k$ 次乘积，相比重复平方法减少了 21%。对于基数 15，数值优化产生了一个 4 乘积核，该核在 14 次及以下阶次与目标匹配，但在 $\ge 15$ 阶次存在非零溢出（额外项）。由于溢出破坏了标准的伸缩更新，我们引入了一个基于残差的基数核框架，该框架可容纳近似核并保持系数 $(μ_m+2)/\log_2 m$。在此框架内，基数 15 达到了 $6/\log_2 15\approx 1.54$，这是目前已知的最佳渐近速率。数值实验支持了预测的乘积次数节省及相关的运行时趋势。