Wasserstein Regression with Empirical Measures and Density Estimation for Sparse Data

The problem of modeling the relationship between univariate distributions and one or more explanatory variables has found increasing interest. Traditional functional data methods cannot be applied directly to distributional data because of their inherent constraints. Modeling distributions as elements of the Wasserstein space, a geodesic metric space equipped with the Wasserstein metric that is related to optimal transport, is attractive for statistical applications. Existing approaches proceed by substituting proxy estimated distributions for the typically unknown response distributions. These estimates are obtained from available data but are problematic when for some of the distributions only few data are available. Such situations are common in practice and cannot be addressed with available approaches, especially when one aims at density estimates. We show how this and other problems associated with density estimation such as tuning parameter selection and bias issues can be side-stepped when covariates are available. We also introduce a novel version of distribution-response regression that is based on empirical measures. By avoiding the preprocessing step of recovering complete individual response distributions, the proposed approach is applicable when the sample size available for some of the distributions is small. In this case, one can still obtain consistent distribution estimates even for distributions with only few data by gaining strength across the entire sample of distributions, while traditional approaches where distributions or densities are estimated individually fail, since sparsely sampled densities cannot be consistently estimated. The proposed model is demonstrated to outperform existing approaches through simulations. Its efficacy is corroborated in two case studies on Environmental Influences on Child Health Outcomes (ECHO) data and eBay auction data.

翻译：建模单变量分布与一个或多个解释变量之间的关系问题日益受到关注。由于分布数据的固有限制，传统函数型数据方法无法直接应用。将分布建模为Wasserstein空间的元素——这一配备与最优输运相关的Wasserstein度量的测地度量空间——对统计应用具有吸引力。现有方法通过采用替代估计分布来替代通常未知的响应分布。这些估计值从可用数据中获得，但当某些分布仅有少量数据可用时，会存在问题。此类情况在实践中常见，而现有方法无法应对，尤其是在进行密度估计时。我们展示了当协变量可用时，如何规避这一问题以及其他与密度估计相关的难题（如调参选择与偏差问题）。同时，我们引入了一种基于经验测度的分布-响应回归新版本。通过避免恢复完整个体响应分布的预处理步骤，所提方法在部分分布样本量较小时仍适用。此时，即使对于仅含少量数据的分布，通过整合整个分布样本的信息仍可获得一致的分布估计，而传统逐个估计分布或密度的方法因稀疏采样密度无法一致估计而失效。仿真实验表明所提模型优于现有方法。其在环境对儿童健康结局（ECHO）数据与eBay拍卖数据的两项案例研究中的有效性得到了验证。