Are the longstanding robustness issues in NLP resolved by today's larger and more performant models? To address this question, we conduct a thorough investigation using 19 models of different sizes spanning different architectural choices and pretraining objectives. We conduct evaluations using (a) OOD and challenge test sets, (b) CheckLists, (c) contrast sets, and (d) adversarial inputs. Our analysis reveals that not all OOD tests provide further insight into robustness. Evaluating with CheckLists and contrast sets shows significant gaps in model performance; merely scaling models does not make them sufficiently robust. Finally, we point out that current approaches for adversarial evaluations of models are themselves problematic: they can be easily thwarted, and in their current forms, do not represent a sufficiently deep probe of model robustness. We conclude that not only is the question of robustness in NLP as yet unresolved, but even some of the approaches to measure robustness need to be reassessed.
翻译:如今规模更大、性能更优的模型是否解决了NLP中长期存在的鲁棒性问题?为探究此问题,我们使用19种不同规模、涵盖不同架构选择与预训练目标的模型展开深入研究。我们采用以下方法进行评估:(a) 分布外测试与挑战测试集、(b) CheckLists、(c) 对比集、(d) 对抗性输入。分析表明,并非所有分布外测试都能提供对鲁棒性的深入见解。通过CheckLists与对比集的评估揭示了模型性能的重大差距:仅扩大模型规模不足以使其具有充分的鲁棒性。最后,我们指出现有模型对抗性评估方法本身存在问题——它们极易被规避,且当前形式未能对模型鲁棒性进行足够深度的探查。结论是:NLP的鲁棒性问题不仅尚未解决,甚至部分鲁棒性评估方法本身也需要重新审视。