Interpolation-based Data Augmentation (DA) methods (Mixup) linearly interpolate the inputs and labels of two or more training examples. Mixup has more recently been adapted to the field of Natural Language Processing (NLP), mainly for sequence labeling tasks. However, such a simple adoption yields mixed or unstable improvements over the baseline models. We argue that the direct-adoption methods do not account for structures in NLP tasks. To this end, we propose SegMix, a collection of interpolation-based DA algorithms that can adapt to task-specific structures. SegMix poses fewer constraints on data structures, is robust to various hyperparameter settings, applies to more task settings, and adds little computational overhead. In the algorithm's core, we apply interpolation methods on task-specific meaningful segments, in contrast to applying them on sequences as in prior work. We find SegMix to be a flexible framework that combines rule-based DA methods with interpolation-based methods, creating interesting mixtures of DA techniques. We show that SegMix consistently improves performance over strong baseline models in Named Entity Recognition (NER) and Relation Extraction (RE) tasks, especially under data-scarce settings. Furthermore, this method is easy to implement and adds negligible training overhead.
翻译:基于插值的数据增强方法(如Mixup)通过对两个或多个训练样本的输入和标签进行线性插值。近年来,Mixup已被应用于自然语言处理领域,主要针对序列标注任务。然而,这种简单的直接迁移在基准模型上仅带来不稳定或混合的性能提升。我们认为,直接迁移的方法未能考虑NLP任务中的结构特性。为此,我们提出SegMix——一种基于插值的、能够适配任务特定结构的数据增强算法集合。SegMix对数据结构约束更少,对多种超参数设置鲁棒,适用于更多任务场景,且仅增加极小的计算开销。该算法的核心在于对任务语义片段进行插值,而非像先前工作那样直接对整个序列操作。我们发现SegMix是一个灵活的框架,能够将基于规则的数据增强方法与基于插值的方法相结合,形成有趣的数据增强技术组合。实验表明,在命名实体识别和关系抽取任务中,SegMix能够持续提升强基准模型的性能,尤其在数据稀缺场景下效果显著。此外,该方法易于实现,且仅增加可忽略的训练开销。