Traditional low-rank approximation is a powerful tool to compress the huge data matrices that arise in simulations of partial differential equations (PDE), but suffers from high computational cost and requires several passes over the PDE data. The compressed data may also lack interpretability thus making it difficult to identify feature patterns from the original data. To address this issue, we present an online randomized algorithm to compute the interpolative decomposition (ID) of large-scale data matrices in situ. Compared to previous randomized IDs that used the QR decomposition to determine the column basis, we adopt a streaming ridge leverage score-based column subset selection algorithm that dynamically selects proper basis columns from the data and thus avoids an extra pass over the data to compute the coefficient matrix of the ID. In particular, we adopt a single-pass error estimator based on the non-adaptive Hutch++ algorithm to provide real-time error approximation for determining the best coefficients. As a result, our approach only needs a single pass over the original data and thus is suitable for large and high-dimensional matrices stored outside of core memory or generated in PDE simulations. We also provide numerical experiments on turbulent channel flow and ignition simulations, and on the NSTX Gas Puff Image dataset, comparing our algorithm with the offline ID algorithm to demonstrate its utility in real-world applications.
翻译:传统的低秩近似是压缩偏微分方程(PDE)仿真中产生的大规模数据矩阵的有力工具,但其计算成本高昂,且需要对PDE数据进行多轮遍历。压缩后的数据也可能缺乏可解释性,从而难以从原始数据中识别特征模式。为解决这一问题,本文提出一种在线随机算法,用于原位计算大规模数据矩阵的插值分解(ID)。与以往使用QR分解确定列基的随机ID方法相比,我们采用一种基于流式岭杠杆得分的列子集选择算法,该算法动态地从数据中选择合适的基列,从而避免了为计算ID的系数矩阵而对数据进行额外遍历。特别地,我们采用基于非自适应Hutch++算法的单遍误差估计器,为确定最佳系数提供实时误差近似。因此,我们的方法仅需对原始数据进行单次遍历,适用于存储在核心内存之外或由PDE仿真生成的大型高维矩阵。我们还通过对湍流通道流与点火仿真以及NSTX Gas Puff Image数据集进行数值实验,将我们的算法与离线ID算法进行比较,以证明其在真实应用中的有效性。