We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels -- from orthography to dialect and style -- and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE's utility by evaluating both fine-tuned models and LLMs across six diverse NLP tasks (four classification and two generation tasks), and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) LLMs still exhibit significant brittleness to certain linguistic variations, with reasoning LLMs surprisingly showing less robustness on some tasks compared to base models; (3) models are overall more brittle to natural, fluent modifications such as syntax or style changes (and especially to negation), compared to corruption-style tests such as letter flipping; (4) the ability of a model to use a linguistic feature in generation does not correlate to its robustness to this feature on downstream tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.
翻译:本文提出FLUKE(面向语言驱动与任务无关的鲁棒性评估框架),该框架通过系统性地对测试数据进行最小化变异来评估模型鲁棒性。FLUKE在从正字法到方言与风格的多层次语言维度上引入受控变异,并利用经过人工验证的大语言模型(LLMs)生成修改内容。我们通过在六种不同的自然语言处理任务(四项分类任务与两项生成任务)上对微调模型与大语言模型进行评估,证明了FLUKE的实用性,并揭示出:(1)语言变异的影响高度依赖具体任务,某些测试对特定任务至关重要而对其他任务无关紧要;(2)大语言模型对特定语言变异仍表现出显著脆弱性,其中推理型大语言模型在某些任务上反而比基础模型表现出更低的鲁棒性;(3)相较于字母翻转等破坏式测试,模型整体上对自然流畅的修改(如句法或风格变化,尤其是否定形式)更为脆弱;(4)模型在生成任务中运用特定语言特征的能力与其在下游任务中对相应特征的鲁棒性并不相关。这些发现凸显了系统性鲁棒性测试对于理解模型行为的重要性。