Pretraining neural networks with massive unlabeled datasets has become popular as it equips the deep models with a better prior to solve downstream tasks. However, this approach generally assumes that the downstream tasks have access to annotated data of sufficient size. In this work, we propose ALOE, a novel system for improving the data- and label-efficiency of non-semantic speech tasks with active learning. ALOE uses pretrained models in conjunction with active learning to label data incrementally and learn classifiers for downstream tasks, thereby mitigating the need to acquire labeled data beforehand. We demonstrate the effectiveness of ALOE on a wide range of tasks, uncertainty-based acquisition functions, and model architectures. Training a linear classifier on top of a frozen encoder with ALOE is shown to achieve performance similar to several baselines that utilize the entire labeled data.
翻译:利用大规模无标签数据集预训练神经网络已成为一种流行方法,因为它能为深度模型提供更好的先验知识以解决下游任务。然而,该方法通常假设下游任务能够获取足够规模的标注数据。本文提出ALOE系统——一种通过主动学习提升非语义语音任务数据与标签效率的新型架构。ALOE结合预训练模型与主动学习,逐步标注数据并训练下游任务分类器,从而减轻对预标注数据的需求。我们在多种任务、基于不确定性的采集函数及模型架构上验证了ALOE的有效性。实验表明,在冻结编码器上采用ALOE训练线性分类器,其性能可与使用全部标注数据的多个基线模型媲美。