Adaptive Voronoi-based Column Selection Methods for Interpretable Dimensionality Reduction

In data analysis, there continues to be a need for interpretable dimensionality reduction methods whereby instrinic meaning associated with the data is retained in the reduced space. Standard approaches such as Principal Component Analysis (PCA) and the Singular Value Decomposition (SVD) fail at this task. A popular alternative is the CUR decomposition. In an SVD-like manner, the CUR decomposition approximates a matrix $A \in \mathbb{R}^{m \times n}$ as $A \approx CUR$, where $C$ and $R$ are matrices whose columns and rows are selected from the original matrix \cite{goreinov1997theory}, \cite{mahoney2009cur}. The difficulty in constructing a CUR decomposition is in determining which columns and rows to select when forming $C$ and $R$. Current column/row selection algorithms, particularly those that rely on an SVD, become infeasible as the size of the data becomes large \cite{dong2021simpler}. We address this problem by reducing the column/row selection problem to a collection of smaller sub-problems. The basic idea is to first partition the rows/columns of a matrix, and then apply an existing selection algorithm on each piece; for illustration purposes we use the Discrete Empirical Interpolation Method (\textsf{DEIM}) \cite{sorensen2016deim}. For the first task, we consider two existing algorithms that construct a Voronoi Tessellation (VT) of the rows and columns of a given matrix. We then extend these methods to automatically adapt to the data. The result is four data-driven row/column selection methods that are well-suited for parallelization, and compatible with nearly any existing column/row selection strategy. Theory and numerical examples show the design to be competitive with the original \textsf{DEIM} routine.

翻译：在数据分析领域，始终需要能够保留数据内在含义的可解释降维方法，而主成分分析（PCA）和奇异值分解（SVD）等标准方法无法实现这一目标。一种流行的替代方案是CUR分解。与SVD类似，CUR分解将矩阵$A \in \mathbb{R}^{m \times n}$近似为$A \approx CUR$，其中$C$和$R$分别是列和行从原始矩阵中选取的矩阵\cite{goreinov1997theory}, \cite{mahoney2009cur}。构建CUR分解的难点在于确定构成$C$和$R$时应选取哪些列和行。当前的列/行选择算法，特别是依赖SVD的方法，在数据规模过大时变得不可行\cite{dong2021simpler}。我们通过将列/行选择问题分解为一系列更小的子问题来解决这一难题。基本思路是首先对矩阵的行/列进行分区，然后对每个分区应用现有的选择算法；为便于说明，我们采用离散经验插值法（\textsf{DEIM}）\cite{sorensen2016deim}。针对第一步任务，我们考虑两种现有算法，通过构建给定矩阵行和列的Voronoi图（VT）来实现。随后我们将这些方法扩展为能自动适应数据特征。最终得到四种数据驱动的行/列选择方法，这些方法不仅适合并行化处理，而且几乎兼容所有现有的列/行选择策略。理论分析与数值实验表明，该设计方案与原始\textsf{DEIM}流程相比具有竞争力。