Modern machine learning models require large labelled datasets to achieve good performance, but manually labelling large datasets is expensive and time-consuming. The data programming paradigm enables users to label large datasets efficiently but produces noisy labels, which deteriorates the downstream model's performance. The active learning paradigm, on the other hand, can acquire accurate labels but only for a small fraction of instances. In this paper, we propose ActiveDP, an interactive framework bridging active learning and data programming together to generate labels with both high accuracy and coverage, combining the strengths of both paradigms. Experiments show that ActiveDP outperforms previous weak supervision and active learning approaches and consistently performs well under different labelling budgets.
翻译:现代机器学习模型需要大量标注数据集才能获得良好性能,但人工标注大规模数据集成本高昂且耗时。数据编程范式使用户能够高效标注大规模数据集,但会产生噪声标签,从而降低下游模型的性能。而主动学习范式虽然仅能获取少量实例的精确标签。本文提出ActiveDP这一交互式框架,通过桥接主动学习与数据编程,结合两种范式的优势,生成兼具高准确率与高覆盖率的标签。实验表明,ActiveDP优于先前的弱监督与主动学习方法,并在不同标注预算下均能保持稳定性能。