With the increasing capabilities of large language models (LLMs), these high-performance models have achieved state-of-the-art results on a wide range of natural language processing (NLP) tasks. However, the models' performance on commonly-used benchmark datasets often fails to accurately reflect their reliability and robustness when applied to real-world noisy data. To address these challenges, we propose a unified robustness evaluation framework based on the slot-filling task to systematically evaluate the dialogue understanding capability of LLMs in diverse input perturbation scenarios. Specifically, we construct a input perturbation evaluation dataset, Noise-LLM, which contains five types of single perturbation and four types of mixed perturbation data. Furthermore, we utilize a multi-level data augmentation method (character, word, and sentence levels) to construct a candidate data pool, and carefully design two ways of automatic task demonstration construction strategies (instance-level and entity-level) with various prompt templates. Our aim is to assess how well various robustness methods of LLMs perform in real-world noisy scenarios. The experiments have demonstrated that the current open-source LLMs generally achieve limited perturbation robustness performance. Based on these experimental observations, we make some forward-looking suggestions to fuel the research in this direction.
翻译:随着大语言模型(LLM)能力的不断提升,这些高性能模型已在众多自然语言处理(NLP)任务中取得了最先进的结果。然而,模型在常用基准数据集上的性能往往无法准确反映其应用于真实噪声数据时的可靠性与鲁棒性。为应对这些挑战,我们提出了一种基于槽填充任务的统一鲁棒性评估框架,以系统评估大语言模型在各类输入扰动场景下的对话理解能力。具体而言,我们构建了输入扰动评估数据集Noise-LLM,其中包含五种单扰动类型和四种混合扰动类型数据。此外,我们采用多层次数据增强方法(字符级、词级及句子级)构建候选数据池,并精心设计了两种自动任务演示构建策略(实例级和实体级)配合多种提示模板。其目的在于评估各类大语言模型鲁棒性方法在真实噪声场景中的表现。实验表明,当前开源大语言模型普遍表现出有限的扰动鲁棒性。基于这些实验观察,我们提出了一些前瞻性建议以推动该方向的研究。