In this manuscript, we study the problem of scalar-on-distribution regression; that is, instances where subject-specific distributions or densities, or in practice, repeated measures from those distributions, are the covariates related to a scalar outcome via a regression model. We propose a direct regression for such distribution-valued covariates that circumvents estimating subject-specific densities and directly uses the observed repeated measures as covariates. The model is invariant to any transformation or ordering of the repeated measures. Endowing the regression function with a Gaussian Process prior, we obtain closed form or conjugate Bayesian inference. Our method subsumes the standard Bayesian non-parametric regression using Gaussian Processes as a special case. Theoretically, we show that the method can achieve an optimal estimation error bound. To our knowledge, this is the first theoretical study on Bayesian regression using distribution-valued covariates. Through simulation studies and analysis of activity count dataset, we demonstrate that our method performs better than approaches that require an intermediate density estimation step.
翻译:本文研究标量对分布回归问题,即个体特定分布或密度(实践中表现为来自这些分布的重测量)作为协变量,通过回归模型与标量结果相关联的场景。我们针对此类分布值协变量提出一种直接回归方法,该方法无需估计个体特定密度,而是直接使用观测到的重测量作为协变量。该模型对重测量的任何变换或排序具有不变性。通过为回归函数赋予高斯过程先验,我们获得了封闭形式或共轭贝叶斯推断。我们的方法将使用高斯过程的标准贝叶斯非参数回归作为特例纳入其中。理论上,我们证明该方法能够达到最优估计误差界。据我们所知,这是关于分布值协变量贝叶斯回归的首项理论研究。通过模拟研究和活动计数数据集分析,我们证明该方法优于需要中间密度估计步骤的现有方法。