Deep learning models for natural language processing rely heavily on high-quality labeled datasets. However, existing labeling approaches often struggle to balance label quality with labeling cost. To address this challenge, we propose DALL, a text labeling framework that integrates data programming, active learning, and large language models. DALL introduces a structured specification that allows users and large language models to define labeling functions via configuration, rather than code. Active learning identifies informative instances for review, and the large language model analyzes these instances to help users correct labels and to refine or suggest labeling functions. We implement DALL as an interactive labeling system for text labeling tasks. Comparative, ablation, and usability studies demonstrate DALL's efficiency, the effectiveness of its modules, and its usability.
翻译:自然语言处理的深度学习模型高度依赖于高质量的标注数据集。然而,现有的标注方法往往难以在标注质量与标注成本之间取得平衡。为应对这一挑战,我们提出了DALL,一个融合了数据编程、主动学习与大语言模型的文本标注框架。DALL引入了一种结构化规范,允许用户和大语言模型通过配置而非编写代码来定义标注函数。主动学习识别出需要审核的信息丰富实例,大语言模型则分析这些实例,以协助用户修正标签,并优化或建议新的标注函数。我们将DALL实现为一个用于文本标注任务的交互式标注系统。对比实验、消融实验及可用性研究证明了DALL的高效性、其各模块的有效性以及良好的可用性。