The field of text generation suffers from a severe shortage of labeled data due to the extremely expensive and time consuming process involved in manual annotation. A natural approach for coping with this problem is active learning (AL), a well-known machine learning technique for improving annotation efficiency by selectively choosing the most informative examples to label. However, while AL has been well-researched in the context of text classification, its application to text generation remained largely unexplored. In this paper, we present a first systematic study of active learning for text generation, considering a diverse set of tasks and multiple leading AL strategies. Our results indicate that existing AL strategies, despite their success in classification, are largely ineffective for the text generation scenario, and fail to consistently surpass the baseline of random example selection. We highlight some notable differences between the classification and generation scenarios, and analyze the selection behaviors of existing AL strategies. Our findings motivate exploring novel approaches for applying AL to NLG tasks.
翻译:文本生成领域面临标注数据严重短缺的问题,因为人工标注过程极其昂贵且耗时。应对这一问题的自然方法是采用主动学习——一种通过选择性标注最具信息量的样本来提升标注效率的经典机器学习技术。然而,尽管主动学习在文本分类领域已有深入研究,其在文本生成中的应用仍鲜有探索。本文首次系统研究面向文本生成的主动学习,涵盖多种任务类型与主流主动学习策略。实验结果表明:现有主动学习策略虽然在分类任务中表现优异,但在文本生成场景中基本无效,且无法持续超越随机选择样本的基线方法。我们重点分析了分类与生成场景之间的显著差异,并剖析了现有主动学习策略的选择行为规律。这些发现为探索将主动学习应用于自然语言生成任务的新型方法提供了研究动力。