Recent research has made significant strides in applying alignment techniques to enhance the helpfulness and harmlessness of large language models (LLMs) in accordance with human intentions. In this paper, we argue for the importance of alignment for honesty, ensuring that LLMs proactively refuse to answer questions when they lack knowledge, while still not being overly conservative. However, a pivotal aspect of alignment for honesty involves discerning the limits of an LLM's knowledge, which is far from straightforward. This challenge demands comprehensive solutions in terms of metric development, benchmark creation, and training methodologies. In this paper, we address these challenges by first establishing a precise problem definition and defining ``honesty'' inspired by the Analects of Confucius. This serves as a cornerstone for developing metrics that effectively measure an LLM's honesty by quantifying its progress post-alignment. Furthermore, we introduce a flexible training framework which is further instantiated by several efficient fine-tuning techniques that emphasize honesty without sacrificing performance on other tasks. Our extensive experiments reveal that these aligned models show a marked increase in honesty, as indicated by our proposed metrics. We open-source a wealth of resources to facilitate future research at https://github.com/GAIR-NLP/alignment-for-honesty, including honesty-aligned models, training and evaluation datasets for honesty alignment, concept glossary, as well as all relevant source code.
翻译:近期研究在应用对齐技术以增强大型语言模型(LLMs)符合人类意图的助益性与无害性方面取得了显著进展。本文论证了面向诚实的对齐的重要性,确保LLMs在缺乏相关知识时主动拒绝回答问题,同时避免过度保守。然而,面向诚实的对齐关键环节在于识别LLMs的知识边界,这一过程远非直截了当。该挑战需要从指标体系构建、基准测试创建及训练方法三方面提出综合性解决方案。本文首先确立精确的问题定义,并借鉴《论语》精神定义「诚实」概念,以此为基础开发能够量化LLMs对齐后诚实度提升效果的有效度量指标。此外,我们提出灵活的训练框架,并实例化若干强调诚实性且不牺牲其他任务性能的高效微调技术。大量实验表明,根据所提出的度量指标,对齐后的模型诚实性显著提升。我们已开源大量资源以促进后续研究(https://github.com/GAIR-NLP/alignment-for-honesty),包括诚实对齐模型、训练与评估数据集、概念词汇表及所有相关源代码。