NLP models are used in a variety of critical social computing tasks, such as detecting sexist, racist, or otherwise hateful content. Therefore, it is imperative that these models are robust to spurious features. Past work has attempted to tackle such spurious features using training data augmentation, including Counterfactually Augmented Data (CADs). CADs introduce minimal changes to existing training data points and flip their labels; training on them may reduce model dependency on spurious features. However, manually generating CADs can be time-consuming and expensive. Hence in this work, we assess if this task can be automated using generative NLP models. We automatically generate CADs using Polyjuice, ChatGPT, and Flan-T5, and evaluate their usefulness in improving model robustness compared to manually-generated CADs. By testing both model performance on multiple out-of-domain test sets and individual data point efficacy, our results show that while manual CADs are still the most effective, CADs generated by ChatGPT come a close second. One key reason for the lower performance of automated methods is that the changes they introduce are often insufficient to flip the original label.
翻译:NLP模型被广泛应用于各种关键的社会计算任务中,例如检测性别歧视、种族歧视或其他仇恨内容。因此,这些模型必须对虚假特征具有鲁棒性。以往的工作尝试通过训练数据增强(包括因果增强数据,CADs)来解决此类虚假特征问题。CADs在现有训练数据点中引入最小变化并翻转其标签;在其上进行训练可能减少模型对虚假特征的依赖。然而,手动生成CADs可能耗时且昂贵。因此,在本研究中,我们评估这一任务是否可以使用生成式NLP模型来自动化。我们使用Polyjuice、ChatGPT和Flan-T5自动生成CADs,并评估其在提升模型鲁棒性方面与手动生成的CADs相比的有效性。通过测试模型在多个域外测试集上的性能以及单个数据点的效用,结果表明虽然手动生成的CADs仍然最有效,但ChatGPT生成的CADs紧随其后。自动化方法性能较低的一个关键原因是它们引入的变化通常不足以翻转原始标签。