We study distributional data under sparse sampling where each unit is represented by a probability distribution on the real line observed only through a small i.i.d.~sample. A natural notion of central tendency for one-dimensional distributional data is the Wasserstein barycenter, whose quantile function is the pointwise average of the unit-level quantile functions. We focus on pointwise estimation of the Wasserstein barycenter quantile function: at a given quantile level, the target is the population mean of the corresponding unit-level quantiles. A naive plug-in estimator is the empirical Wasserstein barycenter, which treats observed unit-level empirical distributions as the true latent unit-level distributions. Under sparse sampling, however, this estimator can be severely biased. We propose an approach that avoids directly estimating either the unit-level distributions or the full population law of distributions. We start with the more ambitious goal of characterizing the distribution of latent unit-level quantiles at a given quantile level. We show that this distribution can be written in terms of the marginal distributions of the unit-level CDF values, which can be estimated using binomial mixture methods. This motivates our estimator, the marginal-constructed barycenter (MCB) estimator, obtained by taking the mean of the estimated distribution of latent unit-level quantiles. We establish conditions under which the MCB estimator is pointwise consistent and asymptotically normal, and show through simulations that it can substantially outperform the empirical Wasserstein barycenter under sparse sampling. We illustrate the method in an analysis of HIV-1 sequence data from the HVTN 502/503 vaccine efficacy trials, using the barycenter to summarize and compare within-participant distributions of viral sequence features when only a small number of sequences are available per participant.
翻译:我们研究稀疏采样下的分布数据,其中每个单元由实直线上的一个概率分布表示,但仅通过少量独立同分布样本观测到。一维分布数据中心趋势的自然度量是Wasserstein重心,其分位函数是各单元分位函数的逐点平均值。本文聚焦于Wasserstein重心分位函数的逐点估计:在给定分位水平上,目标参数是对应单元分位数的总体均值。朴素插件法为经验Wasserstein重心,它直接将观测到的单元经验分布视为真实潜在单元分布。然而在稀疏采样下,该估计量存在严重偏差。我们提出一种避免直接估计单元分布或分布总体概率律的方法。首先从更宏大的目标出发,刻画给定分位水平上潜在单元分位数的分布。我们证明该分布可表示为单元累积分布函数值的边际分布函数形式,后者可通过二项混合方法进行估计。由此构造的边际构造重心估计量通过取估计的潜在单元分位数分布均值获得。我们建立了边际构造重心估计量在逐点一致性和渐近正态性方面成立的条件,并通过模拟表明该估计量在稀疏采样下显著优于经验Wasserstein重心。最后将该方法应用于HYTN 502/503疫苗效力试验的HIV-1序列数据分析,当每位参与者仅有少量可用序列时,利用Wasserstein重心汇总并比较参与者内部病毒序列特征的分布。