As machine learning models are increasingly deployed in dynamic environments, it becomes paramount to assess and quantify uncertainties associated with distribution shifts. A distribution shift occurs when the underlying data-generating process changes, leading to a deviation in the model's performance. The prediction interval, which captures the range of likely outcomes for a given prediction, serves as a crucial tool for characterizing uncertainties induced by their underlying distribution. In this paper, we propose methodologies for aggregating prediction intervals to obtain one with minimal width and adequate coverage on the target domain under unsupervised domain shift, under which we have labeled samples from a related source domain and unlabeled covariates from the target domain. Our analysis encompasses scenarios where the source and the target domain are related via i) a bounded density ratio, and ii) a measure-preserving transformation. Our proposed methodologies are computationally efficient and easy to implement. Beyond illustrating the performance of our method through a real-world dataset, we also delve into the theoretical details. This includes establishing rigorous theoretical guarantees, coupled with finite sample bounds, regarding the coverage and width of our prediction intervals. Our approach excels in practical applications and is underpinned by a solid theoretical framework, ensuring its reliability and effectiveness across diverse contexts.
翻译:随着机器学习模型在动态环境中的广泛应用,评估和量化与分布偏移相关的不确定性变得至关重要。当底层数据生成过程发生变化,导致模型性能偏离时,即发生分布偏移。预测区间能够捕捉给定预测的可能结果范围,是刻画由底层分布所引发不确定性的关键工具。本文提出在非监督域偏移下聚合预测区间的方法,以获得对目标域具有最小宽度和充分覆盖的预测区间;在此设定下,我们拥有来自相关源域的带标签样本以及来自目标域的无标签协变量。我们的分析涵盖源域与目标域通过以下方式关联的场景:i)有界密度比,以及ii)保测变换。所提出的方法计算高效且易于实现。除了通过真实数据集展示我们方法的性能外,我们还深入探讨了理论细节,包括建立关于预测区间覆盖率和宽度的严格理论保证,并辅以有限样本界。我们的方法在实际应用中表现优异,并得到坚实理论框架的支撑,确保了其在各种情境下的可靠性和有效性。