We consider the problem of selecting a small subset of representative variables from a large dataset. In the computer science literature, this dimensionality reduction problem is typically formalized as Column Subset Selection (CSS). Meanwhile, the typical statistical formalization is to find an information-maximizing set of Principal Variables. This paper shows that these two approaches are equivalent, and moreover, both can be viewed as maximum likelihood estimation within a certain semi-parametric model. Using these connections, we show how to efficiently (1) perform CSS using only summary statistics from the original dataset; (2) perform CSS in the presence of missing and/or censored data; and (3) select the subset size for CSS in a hypothesis testing framework.
翻译:我们考虑从大规模数据集中选择少量代表性变量的子集问题。在计算机科学文献中,这一降维问题通常被形式化为列子集选择(CSS)。而典型的统计学形式化方法则是寻找信息最大化的主变量集。本文证明这两种方法等价,且均可视为特定半参数模型下的极大似然估计。利用这些关联,我们展示了如何高效地:(1)仅通过原始数据集的汇总统计量执行CSS;(2)在存在缺失数据和/或删失数据时执行CSS;(3)在假设检验框架中选择CSS的子集规模。