Collecting labeled data for machine learning models is often expensive and time-consuming. Active learning addresses this challenge by selectively labeling the most informative observations, but when initial labeled data is limited, it becomes difficult to distinguish genuinely informative points from those appearing uncertain primarily due to noise. Ensemble methods like random forests are a powerful approach to quantifying this uncertainty but do so by aggregating all models indiscriminately. This includes poor performing models and redundant models, a problem that worsens in the presence of noisy data. We introduce UNique Rashomon Ensembled Active Learning (UNREAL), which selectively ensembles only distinct models from the Rashomon set, which is the set of nearly optimal models. Restricting ensemble membership to high-performing models with different explanations helps distinguish genuine uncertainty from noise-induced variation. We show that UNREAL achieves faster theoretical convergence rates than traditional active learning approaches and demonstrates empirical improvements of up to 20% in predictive accuracy across five benchmark datasets, while simultaneously enhancing model interpretability.
翻译:为机器学习模型收集标注数据通常成本高昂且耗时。主动学习通过选择性标注最具信息量的观测数据来解决这一挑战,但当初始标注数据有限时,难以区分真正具有信息量的数据点与那些主要因噪声而表现出不确定性的数据点。随机森林等集成方法是通过量化这种不确定性的有力途径,但其做法是无差别地聚合所有模型。这包括了性能较差的模型和冗余模型,在存在噪声数据时该问题会加剧。我们提出了唯一拉什蒙集成主动学习(UNREAL),它仅选择性地集成来自拉什蒙集合(即近乎最优模型的集合)中的独特模型。将集成成员限制为具有不同解释的高性能模型,有助于区分真实的不确定性与噪声引起的变异。我们证明,UNREAL 比传统的主动学习方法实现了更快的理论收敛速度,并在五个基准数据集上展示了高达 20% 的预测准确率经验提升,同时增强了模型的可解释性。