Feature-distributed data, referred to data partitioned by features and stored across multiple computing nodes, are increasingly common in applications with a large number of features. This paper proposes a two-stage relaxed greedy algorithm (TSRGA) for applying multivariate linear regression to such data. The main advantage of TSRGA is that its communication complexity does not depend on the feature dimension, making it highly scalable to very large data sets. In addition, for multivariate response variables, TSRGA can be used to yield low-rank coefficient estimates. The fast convergence of TSRGA is validated by simulation experiments. Finally, we apply the proposed TSRGA in a financial application that leverages unstructured data from the 10-K reports, demonstrating its usefulness in applications with many dense large-dimensional matrices.
翻译:特征分布数据是指按特征划分并存储于多个计算节点的数据,在具有大量特征的应用场景中日益普遍。本文提出一种基于两阶段松弛贪心算法(TSRGA)的多元线性回归方法,适用于此类数据。TSRGA的主要优势在于其通信复杂度不依赖于特征维度,因此对超大规模数据集具有高度可扩展性。此外,对于多元响应变量,TSRGA可用于生成低秩系数估计。模拟实验验证了TSRGA的快速收敛性。最后,我们将所提出的TSRGA应用于一项利用10-K报告非结构化数据的金融场景,证明了其在处理包含大量稠密高维矩阵的应用中的实用性。