Collecting labeled data for machine learning models is often expensive and time-consuming. Active learning addresses this challenge by selectively labeling the most informative observations, but when initial labeled data is limited, it becomes difficult to distinguish genuinely informative points from those appearing uncertain primarily due to noise. Ensemble methods like random forests are a powerful approach to quantifying this uncertainty but do so by aggregating all models indiscriminately. This includes poor performing models and redundant models, a problem that worsens in the presence of noisy data. We introduce UNique Rashomon Ensembled Active Learning (UNREAL), which selectively ensembles only distinct models from the Rashomon set, which is the set of nearly optimal models. Restricting ensemble membership to high-performing models with different explanations helps distinguish genuine uncertainty from noise-induced variation. We show that UNREAL achieves faster theoretical convergence rates than traditional active learning approaches and demonstrates empirical improvements of up to 20% in predictive accuracy across five benchmark datasets, while simultaneously enhancing model interpretability.
翻译:为机器学习模型收集标注数据通常成本高昂且耗时。主动学习通过选择性标注最具信息量的观测值来应对这一挑战,但当初始标注数据有限时,很难区分真正具有信息量的数据点与那些主要因噪声而显得不确定的数据点。随机森林等集成方法是通过量化这种不确定性的有效途径,但其不加区分地聚合所有模型,这包括了性能较差的模型和冗余模型,该问题在存在噪声数据时会进一步恶化。本文提出唯一拉什蒙集成主动学习(UNREAL),该方法仅选择性地集成来自拉什蒙集合(即近似最优模型集合)中的独特模型。将集成成员限制为具有不同解释的高性能模型,有助于区分真实的不确定性与噪声引起的变异。我们证明,UNREAL相比传统主动学习方法实现了更快的理论收敛速率,并在五个基准数据集上展现出高达20%的预测准确率提升,同时增强了模型的可解释性。