Traditional low-rank approximation is a powerful tool to compress the huge data matrices that arise in simulations of partial differential equations (PDE), but suffers from high computational cost and requires several passes over the PDE data. The compressed data may also lack interpretability thus making it difficult to identify feature patterns from the original data. To address these issues, we present an online randomized algorithm to compute the interpolative decomposition (ID) of large-scale data matrices {\em in situ}. Compared to previous randomized IDs that used the QR decomposition to determine the column basis, we adopt a streaming ridge leverage score-based column subset selection algorithm that dynamically selects proper basis columns from the data and thus avoids an extra pass over the data to compute the coefficient matrix of the ID. In particular, we adopt a single-pass error estimator based on the non-adaptive Hutch++ algorithm to provide real-time error approximation for determining the best coefficients. As a result, our approach only needs a single pass over the original data and thus is suitable for large and high-dimensional matrices stored outside of core memory or generated in PDE simulations. A strategy to improve the accuracy of the reconstructed data gradient, when desired, within the ID framework is also presented. We provide numerical experiments on turbulent channel flow and ignition simulations, and on the NSTX Gas Puff Image dataset, comparing our algorithm with the offline ID algorithm to demonstrate its utility in real-world applications.
翻译:传统的低秩逼近是压缩偏微分方程(PDE)模拟中产生的大规模数据矩阵的有力工具,但其计算成本高昂,且需要对PDE数据进行多轮遍历。压缩后的数据也可能缺乏可解释性,因而难以从原始数据中识别特征模式。为解决这些问题,本文提出一种在线随机算法,用于大规模数据矩阵的现场插值分解(ID)计算。相较于以往使用QR分解确定列基的随机ID方法,我们采用一种基于流式岭杠杆得分的列子集选择算法,该算法动态地从数据中选取合适的基列,从而避免了为计算ID系数矩阵而对数据进行额外遍历。特别地,我们采用基于非自适应Hutch++算法的单遍误差估计器,为确定最佳系数提供实时误差近似。因此,我们的方法仅需对原始数据进行单次遍历,适用于存储于核心内存之外或由PDE模拟生成的大型高维矩阵。本文还提出了一种在ID框架内按需提高重构数据梯度精度的策略。我们在湍流通道流和点火模拟以及NSTX Gas Puff Image数据集上进行了数值实验,将所提算法与离线ID算法进行比较,以证明其在真实应用场景中的有效性。