This paper develops a new framework, called modular regression, to utilize auxiliary information -- such as variables other than the original features or additional data sets -- in the training process of linear models. At a high level, our method follows the routine: (i) decomposing the regression task into several sub-tasks, (ii) fitting the sub-task models, and (iii) using the sub-task models to provide an improved estimate for the original regression problem. This routine applies to widely-used low-dimensional (generalized) linear models and high-dimensional regularized linear regression. It also naturally extends to missing-data settings where only partial observations are available. By incorporating auxiliary information, our approach improves the estimation efficiency and prediction accuracy upon linear regression or the Lasso under a conditional independence assumption for predicting the outcome. For high-dimensional settings, we develop an extension of our procedure that is robust to violations of the conditional independence assumption, in the sense that it improves efficiency if this assumption holds and coincides with the Lasso otherwise. We demonstrate the efficacy of our methods with simulated and real data sets.
翻译:本文提出一种称为模块化回归的新框架,旨在利用辅助信息(如原始特征以外的变量或额外数据集)来优化线性模型的训练过程。从宏观层面看,该方法遵循以下流程:(i)将回归任务分解为多个子任务,(ii)拟合子任务模型,以及(iii)利用子任务模型为原始回归问题提供改进的估计。该流程适用于广泛使用的低维(广义)线性模型和高维正则化线性回归,并能自然拓展至仅存在部分观测数据的缺失数据场景。通过引入辅助信息,该框架在预测结果的条件独立性假设下,能够提升线性回归或Lasso方法的估计效率与预测精度。针对高维场景,我们开发了该方法的扩展版本,该版本对条件独立性假设的违反具有鲁棒性:当假设成立时能提升效率,反之则退化为Lasso方法。通过模拟数据集和真实数据集的实验结果,我们验证了所提方法的有效性。