In context-specific applications such as robotics, telecommunications, and healthcare, artificial intelligence systems often face the challenge of limited training data. This scarcity introduces epistemic uncertainty, i.e., reducible uncertainty stemming from incomplete knowledge of the underlying data distribution, which fundamentally limits predictive performance. This review paper examines formal methodologies that address data-limited regimes through two complementary approaches: quantifying epistemic uncertainty and mitigating data scarcity via synthetic data augmentation. We begin by reviewing generalized Bayesian learning frameworks that characterize epistemic uncertainty through generalized posteriors in the model parameter space, as well as ``post-Bayes'' learning frameworks. We continue by presenting information-theoretic generalization bounds that formalize the relationship between training data quantity and predictive uncertainty, providing a theoretical justification for generalized Bayesian learning. Moving beyond methods with asymptotic statistical validity, we survey uncertainty quantification methods that provide finite-sample statistical guarantees, including conformal prediction and conformal risk control. Finally, we examine recent advances in data efficiency by combining limited labeled data with abundant model predictions or synthetic data. Throughout, we take an information-theoretic perspective, highlighting the role of information measures in quantifying the impact of data scarcity.
翻译:在机器人学、电信和医疗等特定应用场景中,人工智能系统常面临训练数据有限的挑战。这种数据稀缺性引入了认知不确定性,即源于对底层数据分布认识不完整的可约简不确定性,这从根本上限制了预测性能。本文综述了通过两种互补方法应对数据受限机制的形式化方法论:量化认知不确定性以及通过合成数据增强缓解数据稀缺。我们首先回顾广义贝叶斯学习框架,该框架通过模型参数空间中的广义后验分布来刻画认知不确定性,同时涵盖“后贝叶斯”学习框架。继而提出信息论泛化界,形式化地建立训练数据量与预测不确定性之间的理论关系,为广义贝叶斯学习提供理论依据。在超越具有渐近统计有效性的方法后,我们系统综述了提供有限样本统计保证的不确定性量化方法,包括共形预测与共形风险控制。最后,我们探讨了通过结合有限标注数据与丰富模型预测或合成数据来实现数据效率的最新进展。全文采用信息论视角,着重阐述信息度量在量化数据稀缺性影响中的核心作用。