As machine learning models are increasingly deployed in dynamic environments, it becomes paramount to assess and quantify uncertainties associated with distribution shifts. A distribution shift occurs when the underlying data-generating process changes, leading to a deviation in the model's performance. The prediction interval, which captures the range of likely outcomes for a given prediction, serves as a crucial tool for characterizing uncertainties induced by their underlying distribution. In this paper, we propose methodologies for aggregating prediction intervals to obtain one with minimal width and adequate coverage on the target domain under unsupervised domain shift, under which we have labeled samples from a related source domain and unlabeled covariates from the target domain. Our analysis encompasses scenarios where the source and the target domain are related via i) a bounded density ratio, and ii) a measure-preserving transformation. Our proposed methodologies are computationally efficient and easy to implement. Beyond illustrating the performance of our method through real-world datasets, we also delve into the theoretical details. This includes establishing rigorous theoretical guarantees, coupled with finite sample bounds, regarding the coverage and width of our prediction intervals. Our approach excels in practical applications and is underpinned by a solid theoretical framework, ensuring its reliability and effectiveness across diverse contexts.
翻译:随着机器学习模型在动态环境中的部署日益增多,评估和量化与分布偏移相关的不确定性变得至关重要。分布偏移发生在底层数据生成过程发生变化时,导致模型性能出现偏差。预测区间捕捉了给定预测的可能结果范围,是刻画由底层分布引起的不确定性的关键工具。本文提出了聚合预测区间的方法,以在无监督域偏移下获得一个在目标域上宽度最小且具有足够覆盖度的预测区间。在此设置下,我们拥有来自相关源域的带标签样本和来自目标域的无标签协变量。我们的分析涵盖了源域和目标域通过以下两种方式相关的情景:i) 有界密度比,以及 ii) 保测变换。我们提出的方法计算效率高且易于实现。除了通过真实世界数据集展示我们方法的性能外,我们还深入探讨了理论细节。这包括建立关于我们预测区间覆盖度和宽度的严格理论保证,以及有限样本界。我们的方法在实际应用中表现出色,并得到坚实理论框架的支持,确保了其在各种情境下的可靠性和有效性。