Feature-distributed data, referred to data partitioned by features and stored across multiple computing nodes, are increasingly common in applications with a large number of features. This paper proposes a two-stage relaxed greedy algorithm (TSRGA) for applying multivariate linear regression to such data. The main advantage of TSRGA is that its communication complexity does not depend on the feature dimension, making it highly scalable to very large data sets. In addition, for multivariate response variables, TSRGA can be used to yield low-rank coefficient estimates. The fast convergence of TSRGA is validated by simulation experiments. Finally, we apply the proposed TSRGA in a financial application that leverages unstructured data from the 10-K reports, demonstrating its usefulness in applications with many dense large-dimensional matrices.
翻译:特征分布式数据,即按特征划分并存储于多个计算节点上的数据,在具有大量特征的应用中日益常见。本文提出一种适用于此类数据的两阶段松弛贪婪算法(TSRGA),用于执行多元线性回归。TSRGA的主要优势在于其通信复杂度不依赖于特征维度,从而使其对超大规模数据集具有高度可扩展性。此外,对于多元响应变量,TSRGA可用于生成低秩系数估计。通过仿真实验验证了TSRGA的快速收敛性。最后,我们将所提出的TSRGA应用于一项金融案例中,该案例利用10-K报告中的非结构化数据,展示了其在涉及大量稠密高维矩阵的应用中的实用性。