Integrated principal components analysis, or iPCA, is an unsupervised learning technique for grouped vector data recently defined by Tang and Allen. Like PCA, iPCA computes new axes that best explain the variance of the data, but iPCA is designed to handle corrupting influences by the elements within each group on one another - e.g. data about students at a school grouped into classrooms. Tang and Allen showed empirically that regularized iPCA finds useful features for such grouped data in practice. However, it is not yet known when unregularized iPCA generically exists. For contrast, PCA (which is a special case of iPCA) typically exists whenever the number of data points exceeds the dimension. We study this question and find that the answer is significantly more complicated than it is for PCA. Despite this complexity, we find simple sufficient conditions for a very useful case - when the groups are no more than half as large as the dimension and the total number of data points exceeds the dimension, iPCA generically exists. We also fully characterize the existence of iPCA in case all the groups are the same size. When all groups are not the same size, however, we find that the group sizes for which iPCA generically exists are the integral points in a non-convex union of polyhedral cones. Nonetheless, we exhibit a polynomial time algorithm to decide whether iPCA generically exists (based on the affirmative answer for the saturation conjecture by Knutson and Tao as well as a very simple randomized polynomial time algorithm.
翻译:集成主成分分析(iPCA)是Tang和Allen最近提出的一种针对分组向量数据的无监督学习技术。与主成分分析(PCA)类似,iPCA计算能最佳解释数据方差的新坐标轴,但其设计旨在处理组内元素间的相互干扰影响——例如,将学校学生的数据按班级分组。Tang和Allen通过实验证明,正则化后的iPCA能在实践中为这类分组数据提取有效特征。然而,未正则化iPCA在一般情况下是否存在尚不明确。相比之下,当数据点数量超过维度时,PCA(作为iPCA的特例)通常存在。我们研究此问题后发现,其答案远比PCA的情况复杂。尽管存在这种复杂性,我们仍为一种非常实用的情形找到了简单的充分条件:当各组规模不超过维度的一半且数据点总数超过维度时,iPCA普遍存在。我们还完整刻画了所有组规模相同时iPCA的存在性。但当各组规模不一致时,我们发现iPCA普遍存在的组规模构成一个非凸多面锥体并集的整点。尽管如此,我们提出了一种多项式时间算法(基于Knutson和Tao关于饱和猜想的肯定解答)以及一个非常简单的随机多项式时间算法,用于判定iPCA是否普遍存在。