Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities (for example, health markers, demographics, or political affiliations) and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population. In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to "match" the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size: the total sample size divided by $D_{χ^2}(q\mid\mid\overline{p}) + 1$, where $q$ is the target distribution, $\overline{p}$ is the aggregated source distribution, and $D_{χ^2}$ is the $χ^2$-divergence. We pair this sampling plan with a classical post-stratification estimator and upper bound its risk. We provide matching lower bounds, establishing that our approach achieves the budgeted minimax optimal risk. Our techniques also extend to prediction problems when minimizing the excess risk, providing a principled approach to multi-source learning with costly and heterogeneous data sources.

翻译：数据收集是现代统计与机器学习流程中的关键环节，尤其在需要从多个异质数据源采集数据以研究目标总体时。在许多应用场景（如医学研究或政治民调）中，不同数据源会产生不同的抽样成本。观测数据通常带有相应的群体标识（例如健康指标、人口统计特征或政治倾向），这些群体的相对构成可能在源总体之间、以及在源总体与目标总体之间存在显著差异。本文研究固定预算下的多源数据收集问题，重点关注总体均值与群体条件均值的估计。我们证明，朴素的数据收集策略（例如试图“匹配”目标分布）或依赖标准估计量（如样本均值）可能导致严重的次优结果。为此，我们提出了一种最大化有效样本量的抽样方案：有效样本量定义为总样本量除以$D_{χ^2}(q\mid\mid\overline{p}) + 1$，其中$q$为目标分布，$\overline{p}$为聚合源分布，$D_{χ^2}$为$χ^2$散度。我们将该抽样方案与经典的事后分层估计量结合，并给出其风险上界。通过构建匹配的下界，我们证明该方法达到了预算约束下的极小极大最优风险。我们的技术还可拓展至预测问题中的超额风险最小化，为处理成本高昂且异质的多元数据源学习提供了理论依据。