Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.
翻译:以ChatGPT为代表的大语言模型因其卓越的自然语言处理能力而受到广泛关注。然而,这些大语言模型也带来了诸多挑战,尤其是在可信度方面。因此,确保大语言模型的可信度成为一个重要议题。本文介绍了TrustLLM,一项关于大语言模型可信度的综合性研究,涵盖了不同可信维度的原则、既有基准、主流大语言模型可信度的评估与分析,以及对开放挑战和未来方向的探讨。具体而言,我们首先提出了一套涵盖八个不同维度的可信大语言模型原则。基于这些原则,我们进一步构建了一个横跨真实性、安全性、公平性、鲁棒性、隐私性和机器伦理六个维度的基准。随后,我们开展了一项针对TrustLLM中16个主流大语言模型的评估研究,涉及超过30个数据集。我们的研究结果首先表明,总体而言,可信度与实用性(即功能有效性)呈正相关。其次,我们的观察显示,在可信度方面,专有大语言模型通常优于大多数开源模型,这引发了关于广泛可访问的开源大语言模型潜在风险的担忧。然而,少数开源大语言模型的表现已非常接近专有模型。第三,值得注意的是,某些大语言模型可能在可信度方面被过度校准,以至于它们将良性提示误判为有害并因此不予回应,从而损害了其实用性。最后,我们强调确保透明度的重要性,这不仅涉及模型本身,还包括支撑可信度的底层技术。了解所采用的具体可信技术对于分析其有效性至关重要。