We present FLUKE (Framework for LingUistically-driven and tasK-agnostic robustness Evaluation), a framework for assessing model robustness through systematic minimal variations of test data. FLUKE introduces controlled variations across linguistic levels -- from orthography to dialect and style -- and leverages large language models (LLMs) with human validation to generate modifications. We demonstrate FLUKE's utility by evaluating both fine-tuned models and LLMs across six diverse NLP tasks (four classification and two generation tasks), and reveal that (1) the impact of linguistic variations is highly task-dependent, with some tests being critical for certain tasks but irrelevant for others; (2) LLMs still exhibit significant brittleness to certain linguistic variations, with reasoning LLMs surprisingly showing less robustness on some tasks compared to base models, and scaling improving robustness only for surface-level modifications; (3) models are overall more brittle to natural, fluent modifications such as syntax or style changes (and especially to negation), compared to corruption-style tests such as letter flipping; (4) the ability of a model to use a linguistic feature in generation does not correlate to its robustness to this feature on downstream tasks. These findings highlight the importance of systematic robustness testing for understanding model behaviors.
翻译:本文提出FLUKE(面向语言驱动与任务无关的鲁棒性评估框架),该框架通过系统性地对测试数据进行最小化变异来评估模型鲁棒性。FLUKE在多个语言层面——从正字法到方言与文体——引入受控变异,并利用经过人工验证的大语言模型(LLMs)生成修改内容。我们通过在六种不同的自然语言处理任务(四项分类任务与两项生成任务)上对微调模型及大语言模型进行评估,展示了FLUKE的实用性,并揭示出:(1)语言变异的影响高度依赖于具体任务,某些测试对特定任务至关重要,而对其他任务则无关紧要;(2)大语言模型对某些语言变异仍表现出显著的脆弱性,其中推理型大语言模型在部分任务上的鲁棒性甚至低于基础模型,且模型规模的扩大仅对表层修改的鲁棒性有所提升;(3)与字符翻转等破坏式测试相比,模型总体上对自然流畅的修改(如句法或文体变化,尤其是否定表达)更为敏感;(4)模型在生成任务中运用某一语言特征的能力,与其在下游任务中对该特征的鲁棒性并不相关。这些发现凸显了系统性鲁棒性测试对于理解模型行为的重要性。