The swift advancement in the scale and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLM, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, which do not align with the superior generation capabilities of contemporary LLMs. To address this issue, we propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools to evaluate the robustness of LLMs, which we refer to as the Reward Model for Reasonable Robustness Evaluation (TREvaL). Our extensive empirical experiments have demonstrated that TREval provides an accurate method for evaluating the robustness of an LLM, especially when faced with more challenging open questions. Furthermore, our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations, which are commonplace in daily language usage. Notably, we were surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted. The code of TREval is available in https://github.com/Harry-mic/TREval.
翻译:随着大规模语言模型(LLMs)在规模和能力上的迅速发展,它们已成为多种下游任务中有前景的工具。为了确保LLM的责任性,除了追求更优性能并避免对特定提示产生激烈反馈外,其鲁棒性也受到广泛关注。然而,现有评估方法大多依赖于具有预定义监督标签的传统问答数据集,这与当代LLM的卓越生成能力不相符。为解决这一问题,我们提出了一种新颖的理性评估方法,利用预训练的奖励模型作为诊断工具来评估LLM的鲁棒性,并将其称为用于合理鲁棒性评估的奖励模型(Reward Model for Reasonable Robustness Evaluation, TREvaL)。大量实验证明,TREvaL为评估LLM的鲁棒性提供了准确方法,尤其是在面对更具挑战性的开放性问题时。此外,我们的结果表明,LLM常对日常语言使用中常见的词级扰动表现出脆弱性。值得注意的是,我们惊讶地发现,随着微调(SFT和RLHF)的进行,鲁棒性反而趋于下降。TREvaL的代码可在https://github.com/Harry-mic/TREval获取。