Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities - for example, health markers, demographics, or political affiliations - and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to "match" the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size - the total sample size divided by $D_{χ^2}(q\mid\mid\overline{p}) + 1$, where $q$ is the target distribution, $\overline{p}$ is the aggregated source distribution, and $D_{χ^2}$ is the $χ^2$-divergence. We pair this sampling plan with a classical post-stratification estimator and upper bound its risk. We provide matching lower bounds, establishing that our approach achieves the budgeted minimax optimal risk. Our techniques also extend to prediction problems when minimizing the excess risk, providing a principled approach to multi-source learning with costly and heterogeneous data sources.

翻译：数据收集是现代统计与机器学习流程中的关键组成部分，尤其在需要从多个异质数据源收集数据以研究特定目标人群时。在医疗研究或政治民意调查等许多场景中，不同数据源会产生不同采样成本。观测数据常具有关联的群体标识（例如健康指标、人口统计学特征或政治倾向），而这些群体的相对构成可能在不同源群体之间以及源群体与目标人群之间存在显著差异。本文研究固定预算约束下的多源数据收集问题，重点关注总体均值与条件分组均值的估计。研究表明，朴素的数据收集策略（例如试图"匹配"目标分布）或依赖标准估计量（例如样本均值）可能高度次优。相反，我们开发了一种采样方案，可最大化有效样本量——总样本量除以 $D_{\chi^2}(q \parallel \overline{p}) + 1$，其中 $q$ 为目标分布，$\overline{p}$ 为聚合后的数据源分布，$D_{\chi^2}$ 为 $\chi^2$ 散度。我们将此采样方案与经典的后分层估计量相结合，并对其风险给出上界。我们进一步给出匹配的下界，证明所提方法在预算约束下实现了极小极大最优风险。相关技术还可推广至最小化超额风险时的预测问题，为处理代价高昂且异质数据源的多源学习提供了一种系统性方法。