In conventional statistical and machine learning methods, it is typically assumed that the test data are identically distributed with the training data. However, this assumption does not always hold, especially in applications where the target population are not well-represented in the training data. This is a notable issue in health-related studies, where specific ethnic populations may be underrepresented, posing a significant challenge for researchers aiming to make statistical inferences about these minority groups. In this work, we present a novel approach to addressing this challenge in linear regression models. We organize the model parameters for all the sub-populations into a tensor. By studying a structured tensor completion problem, we can achieve robust domain generalization, i.e., learning about sub-populations with limited or no available data. Our method novelly leverages the structure of group labels and it can produce more reliable and interpretable generalization results. We establish rigorous theoretical guarantees for the proposed method and demonstrate its minimax optimality. To validate the effectiveness of our approach, we conduct extensive numerical experiments and a real data study focused on education level prediction for multiple ethnic groups, comparing our results with those obtained using other existing methods.
翻译:在传统统计与机器学习方法中,通常假设测试数据与训练数据同分布。然而,这一假设并非总是成立,尤其是在训练数据未能充分代表目标人群的应用场景中。这一问题在健康相关研究中尤为突出——特定种族人群可能代表性不足,给研究人员针对这些少数群体进行统计推断带来重大挑战。本文针对线性回归模型中的这一挑战提出了创新性解决方案:将全体子群体的模型参数组织为张量结构,通过研究结构化张量补全问题,实现稳健的域泛化——即从数据有限甚至缺失的子群体中学习规律。本方法创新性地利用了组标签结构,能够产生更可靠且可解释的泛化结果。我们为所提方法建立了严格的理论保证,并证明了其极小化最优性。为验证方法有效性,我们开展了大量数值实验及一项聚焦多民族群体教育水平预测的真实数据研究,将所得结果与现有其他方法进行了比较。