Numeric modeling of electromagnetics and acoustics frequently entails matrix-vector multiplication with block Toeplitz structure. When the corresponding block Toeplitz matrix is not highly sparse, e.g. when considering the electromagnetic Green function in a spatial basis, such calculations are often carried out by performing a multilevel embedding that gives the matrix a fully circulant form. While this transformation allows the associated matrix-vector multiplication to be computed via Fast Fourier Transforms (FFTs) and diagonal multiplication, generally leading to dramatic performance improvements compared to naive multiplication, it also adds unnecessary information that increases memory consumption and reduces computational efficiency. As an improvement, we propose a lazy embedding, eager projection, algorithm that for dimensionality $d$, asymptotically reduces the number of needed computations $\propto d/ \left(2 - 2^{-d+1}\right)$ and peak memory usage $\propto 2/\left((d+1)2^{-d} + 1\right)$, generally, and $\propto\left(2^{d} + 1\right)/\left(d +2\right)$ for a fully symmetric or skew-symmetric systems. The structure of the algorithm suggests several simple approaches for parallelization of large block Toeplitz matrix-vector products across multiple devices and adds flexibility in memory and task management.
翻译:电磁学和声学的数值建模经常涉及具有分块Toeplitz结构的矩阵向量乘法。当对应的分块Toeplitz矩阵并非高度稀疏时(例如在空间基中考虑电磁格林函数的情形),此类计算通常通过执行多级嵌入来实现,该嵌入使矩阵具有完全循环形式。虽然这种变换允许通过快速傅里叶变换(FFTs)和对角乘法来计算相关的矩阵向量乘法(相比朴素乘法通常能带来显著的性能提升),但它也引入了不必要的信息,从而增加了内存消耗并降低了计算效率。作为改进,我们提出了一种惰性嵌入、主动投影的算法,该算法对于维度$d$,渐进地将所需计算量减少$\propto d/ \left(2 - 2^{-d+1}\right)$,峰值内存使用量减少$\propto 2/\left((d+1)2^{-d} + 1\right)$;对于完全对称或斜对称系统,则分别减少$\propto\left(2^{d} + 1\right)/\left(d +2\right)$。该算法的结构为跨多个设备并行化大型分块Toeplitz矩阵向量乘积提供了几种简单方法,并增加了内存和任务管理的灵活性。