Recent years have witnessed remarkable progress made in large language models (LLMs). Such advancements, while garnering significant attention, have concurrently elicited various concerns. The potential of these models is undeniably vast; however, they may yield texts that are imprecise, misleading, or even detrimental. Consequently, it becomes paramount to employ alignment techniques to ensure these models to exhibit behaviors consistent with human values. This survey endeavors to furnish an extensive exploration of alignment methodologies designed for LLMs, in conjunction with the extant capability research in this domain. Adopting the lens of AI alignment, we categorize the prevailing methods and emergent proposals for the alignment of LLMs into outer and inner alignment. We also probe into salient issues including the models' interpretability, and potential vulnerabilities to adversarial attacks. To assess LLM alignment, we present a wide variety of benchmarks and evaluation methodologies. After discussing the state of alignment research for LLMs, we finally cast a vision toward the future, contemplating the promising avenues of research that lie ahead. Our aspiration for this survey extends beyond merely spurring research interests in this realm. We also envision bridging the gap between the AI alignment research community and the researchers engrossed in the capability exploration of LLMs for both capable and safe LLMs.
翻译:近年来,大型语言模型(LLMs)取得了显著进展。这些进步虽引人瞩目,却也引发了诸多担忧。这些模型的潜力无疑巨大,但它们可能生成不精确、具有误导性甚至有害的文本。因此,采用对齐技术确保这些模型的行为符合人类价值观变得至关重要。本综述旨在全面探索针对LLMs设计的对齐方法,并结合该领域现有的能力研究。以人工智能对齐为视角,我们将当前主流的LLMs对齐方法和新兴提议分为外部对齐与内部对齐。我们还探讨了关键问题,包括模型的可解释性以及对对抗攻击的潜在脆弱性。为评估LLM对齐,我们呈现了丰富的基准和评估方法。在讨论LLMs对齐研究现状之后,我们最终展望未来,思考前方充满希望的研究方向。本综述的期望不仅限于激发该领域的研究兴趣,还意在弥合人工智能对齐研究社区与致力于LLMs能力探索的研究者之间的鸿沟,以推动兼具能力与安全性的LLMs的发展。