Active learning is a practical field of machine learning that automates the process of selecting which data to label. Current methods are effective in reducing the burden of data labeling but are heavily model-reliant. This has led to the inability of sampled data to be transferred to new models as well as issues with sampling bias. Both issues are of crucial concern in machine learning deployment. We propose active learning methods utilizing combinatorial coverage to overcome these issues. The proposed methods are data-centric, as opposed to model-centric, and through our experiments we show that the inclusion of coverage in active learning leads to sampling data that tends to be the best in transferring to better performing models and has a competitive sampling bias compared to benchmark methods.
翻译:主动学习是机器学习中一个实用的领域,它能够自动化选择需要标注数据的过程。当前方法虽能有效减轻数据标注的负担,但严重依赖于模型。这导致采样数据无法迁移至新模型,并引发采样偏差问题。这两点在机器学习部署中至关重要。我们提出利用组合覆盖的主动学习方法来克服这些问题。所提方法以数据为中心而非以模型为中心,通过实验表明,在主动学习中引入覆盖能够使采样数据在迁移至性能更优的模型时表现最佳,且与基准方法相比具有竞争力的采样偏差。