We develop an inferential toolkit for analyzing object-valued responses, which correspond to data situated in general metric spaces, paired with Euclidean predictors within the conformal framework. To this end we introduce conditional profile average transport costs, where we compare distance profiles that correspond to one-dimensional distributions of probability mass falling into balls of increasing radius through the optimal transport cost when moving from one distance profile to another. The average transport cost to transport a given distance profile to all others is crucial for statistical inference in metric spaces and underpins the proposed conditional profile scores. A key feature of the proposed approach is to utilize the distribution of conditional profile average transport costs as conformity score for general metric space-valued responses, which facilitates the construction of prediction sets by the split conformal algorithm. We derive the uniform convergence rate of the proposed conformity score estimators and establish asymptotic conditional validity for the prediction sets. The finite sample performance for synthetic data in various metric spaces demonstrates that the proposed conditional profile score outperforms existing methods in terms of both coverage level and size of the resulting prediction sets, even in the special case of scalar and thus Euclidean responses. We also demonstrate the practical utility of conditional profile scores for network data from New York taxi trips and for compositional data reflecting energy sourcing of U.S. states.
翻译:我们开发了一套用于分析对象值响应的推断工具包,这些响应对应位于一般度量空间中的数据,并与共形框架内的欧几里得预测变量配对。为此,我们引入了条件剖面平均运输成本,其中通过最优运输成本将一个距离剖面移动到另一个时,比较对应于概率质量落入半径递增球内的一维分布的距离剖面。将给定距离剖面运输到所有其他剖面的平均运输成本对于度量空间中的统计推断至关重要,并支撑了所提出的条件剖面分数。该方法的一个关键特征是使用条件剖面平均运输成本的分布作为一般度量空间值响应的共形分,从而通过分割共形算法促进预测集的构建。我们推导了所提出的共形分数估计器的均匀收敛率,并建立了预测集的渐近条件有效性。在多种度量空间中合成数据的有限样本性能表明,所提出的条件剖面分数在预测集的覆盖水平和大小方面均优于现有方法,即使在标量(即欧几里得响应)的特殊情况下也是如此。我们还展示了条件剖面分数在纽约出租车行程网络数据和美国各州能源来源成分数据中的实际效用。