As an intrinsic and fundamental property of big data, data heterogeneity exists in a variety of real-world applications, such as precision medicine, autonomous driving, financial applications, etc. For machine learning algorithms, the ignorance of data heterogeneity will greatly hurt the generalization performance and the algorithmic fairness, since the prediction mechanisms among different sub-populations are likely to differ from each other. In this work, we focus on the data heterogeneity that affects the prediction of machine learning models, and firstly propose the \emph{usable predictive heterogeneity}, which takes into account the model capacity and computational constraints. We prove that it can be reliably estimated from finite data with probably approximately correct (PAC) bounds. Additionally, we design a bi-level optimization algorithm to explore the usable predictive heterogeneity from data. Empirically, the explored heterogeneity provides insights for sub-population divisions in income prediction, crop yield prediction and image classification tasks, and leveraging such heterogeneity benefits the out-of-distribution generalization performance.
翻译:作为大数据的一个固有且基本属性,数据异质性存在于多种现实应用场景中,例如精准医学、自动驾驶、金融应用等。对于机器学习算法而言,忽略数据异质性将严重损害模型泛化性能和算法公平性,因为不同子群体之间的预测机制可能彼此不同。本文聚焦于影响机器学习模型预测的数据异质性,首次提出考虑模型容量和计算约束的“可用预测异质性”。我们证明该异质性可从有限数据中可靠估计,并给出可能近似正确(PAC)界限。此外,我们设计了一种双层优化算法,用于从数据中探索可用预测异质性。实验表明,所探索的异质性为收入预测、作物产量预测及图像分类任务中的子群体划分提供了洞见,利用此类异质性有利于提升分布外泛化性能。