Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. Additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used LLMs. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. This highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on LLM alignment. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.

翻译：确保对齐性——即让模型行为符合人类意图[1,2]——已成为在现实应用中部署大语言模型（LLM）前的关键任务。例如，OpenAI在发布GPT-4前耗费六个月时间进行迭代式对齐优化[3]。然而，从业人员面临的主要挑战在于缺乏明确指导，难以评估LLM输出是否符合社会规范、价值观与法规要求。这一障碍阻碍了LLM的系统性迭代与部署。为解决该问题，本文对评估LLM可信赖性需考虑的关键维度进行了全面综述。该综述涵盖LLM可信赖性的七大类：可靠性、安全性、公平性、抗滥用性、可解释性与推理能力、社会规范遵从性及鲁棒性。每个大类进一步细分为若干子类，共计29个子类。此外，从中选取8个子类进行深入研究，针对多个广泛使用的LLM设计了相应测量实验。测量结果表明，总体而言，对齐性更强的模型在整体可信赖性方面表现更优，但不同可信赖性类别中对齐效果存在差异。这凸显了对LLM对齐性进行更细粒度分析、测试与持续改进的重要性。通过揭示LLM可信赖性的关键维度，本文旨在为领域从业人员提供宝贵见解与实践指导。理解并应对这些挑战对于实现LLM在各类应用中的可靠与合乎道德部署至关重要。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日