This is the first work to investigate the effectiveness of BERT-based contextual embeddings in active learning (AL) tasks on cold-start scenarios, where traditional fine-tuning is infeasible due to the absence of labeled data. Our primary contribution is the proposal of a more robust fine-tuning pipeline - DoTCAL - that diminishes the reliance on labeled data in AL using two steps: (1) fully leveraging unlabeled data through domain adaptation of the embeddings via masked language modeling and (2) further adjusting model weights using labeled data selected by AL. Our evaluation contrasts BERT-based embeddings with other prevalent text representation paradigms, including Bag of Words (BoW), Latent Semantic Indexing (LSI), and FastText, at two critical stages of the AL process: instance selection and classification. Experiments conducted on eight ATC benchmarks with varying AL budgets (number of labeled instances) and number of instances (about 5,000 to 300,000) demonstrate DoTCAL's superior effectiveness, achieving up to a 33% improvement in Macro-F1 while reducing labeling efforts by half compared to the traditional one-step method. We also found that in several tasks, BoW and LSI (due to information aggregation) produce results superior (up to 59% ) to BERT, especially in low-budget scenarios and hard-to-classify tasks, which is quite surprising.
翻译:本研究首次探讨了基于BERT的上下文嵌入在冷启动场景下主动学习任务中的有效性,该场景因缺乏标注数据而无法进行传统微调。我们的主要贡献是提出了一种更鲁棒的微调流程——DoTCAL,该流程通过两个步骤降低主动学习对标注数据的依赖:(1) 通过掩码语言建模进行嵌入的领域自适应,充分利用未标注数据;(2) 使用主动学习选择的标注数据进一步调整模型权重。我们在主动学习流程的两个关键阶段——实例选择与分类——对比了基于BERT的嵌入与其他主流文本表示范式(包括词袋模型、潜在语义索引和FastText)的性能。在八个ATC基准数据集上进行的实验涵盖了不同的主动学习预算(标注实例数量)和数据集规模(约5,000至300,000个实例),结果表明DoTCAL具有显著优势:与传统单阶段方法相比,在将标注工作量减半的同时,宏F1分数最高可提升33%。我们还发现,在若干任务中,词袋模型和潜在语义索引(得益于信息聚合)取得了优于BERT的结果(最高达59%),尤其在低预算场景和难分类任务中,这一发现相当令人意外。