The swift advancement in the scales and capabilities of Large Language Models (LLMs) positions them as promising tools for a variety of downstream tasks. In addition to the pursuit of better performance and the avoidance of violent feedback on a certain prompt, to ensure the responsibility of the LLM, much attention is drawn to the robustness of LLMs. However, existing evaluation methods mostly rely on traditional question answering datasets with predefined supervised labels, which do not align with the superior generation capabilities of contemporary LLMs. To address this issue, we propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools to evaluate the longer conversation generated from more challenging open questions by LLMs, which we refer to as the Reward Model for Reasonable Robustness Evaluation (TREvaL). Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions, a capability not entirely encompassed by individual words or letters, which may exhibit oversimplification and inherent biases. Our extensive empirical experiments demonstrate that TREvaL provides an innovative method for evaluating the robustness of an LLM. Furthermore, our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage. Notably, we are surprised to discover that robustness tends to decrease as fine-tuning (SFT and RLHF) is conducted. The code of TREval is available in https://github.com/Harry-mic/TREvaL.
翻译:大型语言模型(LLM)在规模和能力上的快速进步,使其成为多种下游任务有前景的工具。除了追求更优性能和避免对特定提示的激烈反馈外,为确保LLM的可靠性,其稳健性备受关注。然而,现有评估方法大多依赖具有预定义监督标签的传统问答数据集,这与当代LLM卓越的生成能力不匹配。为解决此问题,我们提出一种新颖的理性评估方法,利用预训练奖励模型作为诊断工具,评估LLM对更具挑战性开放问题生成的更长时间对话,我们称之为合理稳健性评估奖励模型(TREvaL)。更长的对话展现了语言模型在理解问题能力方面的全面把握,而这种能力不能完全由单个词语或字母涵盖,后者可能表现出过度简化和固有偏差。我们的大量实证实验表明,TREvaL为评估LLM稳健性提供了一种创新方法。此外,我们的结果表明,LLM在日常语言使用中常见的词级扰动面前常常表现出脆弱性。值得注意的是,我们惊讶地发现,随着微调(SFT和RLHF)的进行,稳健性倾向于下降。TREvaL的代码可在https://github.com/Harry-mic/TREvaL获取。