In context-specific applications such as robotics, telecommunications, and healthcare, artificial intelligence systems often face the challenge of limited training data. This scarcity introduces epistemic uncertainty, i.e., reducible uncertainty stemming from incomplete knowledge of the underlying data distribution, which fundamentally limits predictive performance. This review paper examines formal methodologies that address data-limited regimes through two complementary approaches: quantifying epistemic uncertainty and mitigating data scarcity via synthetic data augmentation. We begin by reviewing generalized Bayesian learning frameworks that characterize epistemic uncertainty through generalized posteriors in the model parameter space, as well as ``post-Bayes'' learning frameworks. We continue by presenting information-theoretic generalization bounds that formalize the relationship between training data quantity and predictive uncertainty, providing a theoretical justification for generalized Bayesian learning. Moving beyond methods with asymptotic statistical validity, we survey uncertainty quantification methods that provide finite-sample statistical guarantees, including conformal prediction and conformal risk control. Finally, we examine recent advances in data efficiency by combining limited labeled data with abundant model predictions or synthetic data. Throughout, we take an information-theoretic perspective, highlighting the role of information measures in quantifying the impact of data scarcity.
翻译:在机器人技术、电信和医疗保健等特定应用场景中,人工智能系统常面临训练数据有限的挑战。这种数据稀缺性引入了认知不确定性,即源于对底层数据分布认识不足的可约减不确定性,从根本上限制了预测性能。本文综述了通过两种互补途径应对数据受限机制的形式化方法:量化认知不确定性以及通过合成数据增强缓解数据稀缺问题。我们首先回顾了广义贝叶斯学习框架,该框架通过模型参数空间中的广义后验分布来刻画认知不确定性,同时涵盖“后贝叶斯”学习框架。接着,我们介绍了信息论泛化界,其形式化地建立了训练数据量与预测不确定性之间的数学关系,为广义贝叶斯学习提供了理论依据。在超越具有渐近统计有效性的方法后,我们综述了提供有限样本统计保证的不确定性量化方法,包括共形预测与共形风险控制。最后,我们探讨了通过结合有限标注数据与丰富模型预测或合成数据来提升数据效率的最新进展。全文贯穿信息论视角,着重阐释信息度量在量化数据稀缺性影响中的核心作用。