CAAL: Confidence-Aware Active Learning for Heteroscedastic Atmospheric Regression

Quantifying the impacts of air pollution on health and climate relies on key atmospheric particle properties such as toxicity and hygroscopicity. However, these properties typically require complex observational techniques or expensive particle-resolved numerical simulations, limiting the availability of labeled data. We therefore estimate these hard-to-measure particle properties from routinely available observations (e.g., air pollutant concentrations and meteorological conditions). Because routine observations only indirectly reflect particle composition and structure, the mapping from routine observations to particle properties is noisy and input-dependent, yielding a heteroscedastic regression setting. With a limited and costly labeling budget, the central challenge is to select which samples to measure or simulate. While active learning is a natural approach, most acquisition strategies rely on predictive uncertainty. Under heteroscedastic noise, this signal conflates reducible epistemic uncertainty with irreducible aleatoric uncertainty, causing limited budgets to be wasted in noise-dominated regions. To address this challenge, we propose a confidence-aware active learning framework (CAAL) for efficient and robust sample selection in heteroscedastic settings. CAAL consists of two components: a decoupled uncertainty-aware training objective that separately optimises the predictive mean and noise level to stabilise uncertainty estimation, and a confidence-aware acquisition function that dynamically weights epistemic uncertainty using predicted aleatoric uncertainty as a reliability signal. Experiments on particle-resolved numerical simulations and real atmospheric observations show that CAAL consistently outperforms standard AL baselines. The proposed framework provides a practical and general solution for the efficient expansion of high-cost atmospheric particle property databases.

翻译：量化空气污染对健康与气候的影响，依赖于毒性及吸湿性等关键大气颗粒物性质。然而，这些性质通常需要复杂的观测技术或昂贵的颗粒物解析数值模拟，导致标记数据稀缺。为此，我们尝试从常规可得的观测数据（如空气污染物浓度与气象条件）中估算这些难以直接测量的颗粒物性质。由于常规观测仅间接反映颗粒物的成分与结构，从常规观测到颗粒物性质的映射存在噪声且具有输入依赖性，从而构成异方差回归场景。在标记预算有限且成本高昂的条件下，核心挑战在于选择哪些样本进行测量或模拟。虽然主动学习是一种自然解决方案，但多数样本获取策略依赖于预测不确定性。在异方差噪声下，该信号将可减少的认知不确定性与不可减少的偶然不确定性混为一谈，导致有限预算被浪费在噪声主导区域。为应对这一挑战，我们提出一种置信度感知主动学习框架（CAAL），用于在异方差场景下实现高效稳健的样本选择。CAAL包含两个核心组件：一个解耦的感知不确定性训练目标，通过分别优化预测均值与噪声水平以稳定不确定性估计；以及一个置信度感知的获取函数，其利用预测的偶然不确定性作为可靠性信号，动态加权认知不确定性。在颗粒物解析数值模拟和真实大气观测数据上的实验表明，CAAL持续优于标准主动学习基线方法。该框架为高效扩展高成本大气颗粒物性质数据库提供了一个实用且通用的解决方案。