We present data augmentation techniques for process extraction tasks in scientific publications. We cast the process extraction task as a sequence labeling task where we identify all the entities in a sentence and label them according to their process-specific roles. The proposed method attempts to create meaningful augmented sentences by utilizing (1) process-specific information from the original sentence, (2) role label similarity, and (3) sentence similarity. We demonstrate that the proposed methods substantially improve the performance of the process extraction model trained on chemistry domain datasets, up to 12.3 points improvement in performance accuracy (F-score). The proposed methods could potentially reduce overfitting as well, especially when training on small datasets or in a low-resource setting such as in chemistry and other scientific domains.
翻译:本文提出了面向科学文献中流程提取任务的数据增强技术。我们将流程提取任务构建为序列标注任务,通过识别句子中的所有实体并根据其流程特定角色进行标注。所提出的方法尝试通过利用以下三种信息来创建有意义的增强句子:(1) 原始句子的流程特定信息,(2) 角色标签相似性,以及(3) 句子相似性。我们证明,所提出的方法显著提升了在化学领域数据集上训练的流程提取模型的性能,性能准确率(F值)最高提升了12.3个百分点。所提出的方法还可能有助于减少过拟合,尤其是在小数据集上进行训练时,或在化学及其他科学领域等低资源场景下。