Supervised neural approaches are hindered by their dependence on large, meticulously annotated datasets, a requirement that is particularly cumbersome for sequential tasks. The quality of annotations tends to deteriorate with the transition from expert-based to crowd-sourced labelling. To address these challenges, we present CAMEL (Confidence-based Acquisition Model for Efficient self-supervised active Learning), a pool-based active learning framework tailored to sequential multi-output problems. CAMEL possesses two core features: (1) it requires expert annotators to label only a fraction of a chosen sequence, and (2) it facilitates self-supervision for the remainder of the sequence. By deploying a label correction mechanism, CAMEL can also be utilised for data cleaning. We evaluate CAMEL on two sequential tasks, with a special emphasis on dialogue belief tracking, a task plagued by the constraints of limited and noisy datasets. Our experiments demonstrate that CAMEL significantly outperforms the baselines in terms of efficiency. Furthermore, the data corrections suggested by our method contribute to an overall improvement in the quality of the resulting datasets.
翻译:监督式神经网络方法受限于其对大规模精细标注数据集的依赖,这一要求在序列任务中尤为繁琐。随着标注工作从专家标注转向众包标注,标注质量往往随之下降。为应对这些挑战,我们提出了CAMEL(基于置信度的高效自监督主动学习获取模型),这是一种专为序列多输出问题设计的池式主动学习框架。CAMEL具备两个核心特征:(1)仅需专家标注者标注选定序列的一部分;(2)为序列的其余部分提供自监督支持。通过部署标签校正机制,CAMEL还可用于数据清洗。我们在两项序列任务上评估了CAMEL,特别关注对话信念跟踪这一长期受限于数据稀缺和噪声困扰的任务。实验表明,CAMEL在效率方面显著优于基线方法。此外,本方法提出的数据校正建议有助于全面提升最终数据集的质量。