As the use of large language models (LLMs) increases within society, as does the risk of their misuse. Appropriate safeguards must be in place to ensure LLM outputs uphold the ethical standards of society, highlighting the positive role that artificial intelligence technologies can have. Recent events indicate ethical concerns around conventionally trained LLMs, leading to overall unsafe user experiences. This motivates our research question: how do we ensure LLM alignment? In this work, we introduce a test suite of unique prompts to foster the development of aligned LLMs that are fair, safe, and robust. We show that prompting LLMs at every step of the development pipeline, including data curation, pre-training, and fine-tuning, will result in an overall more responsible model. Our test suite evaluates outputs from four state-of-the-art language models: GPT-3.5, GPT-4, OPT, and LLaMA-2. The assessment presented in this paper highlights a gap between societal alignment and the capabilities of current LLMs. Additionally, implementing a test suite such as ours lowers the environmental overhead of making models safe and fair.
翻译:随着大型语言模型在社会中的使用日益增加,其被滥用的风险也随之上升。必须设置适当的防护措施,确保大型语言模型的输出符合社会道德标准,凸显人工智能技术可发挥的积极作用。近期事件表明,传统训练的大型语言模型存在伦理问题,导致整体用户体验不安全。这激发了我们的研究问题:如何确保大型语言模型对齐?本文引入了一套独特的提示测试套件,旨在促进开发公平、安全且鲁棒的对齐语言模型。我们证明,在开发管线的每个阶段(包括数据筛选、预训练和微调)进行提示测试,将产生更负责任的模型。该测试套件评估了四种最先进的语言模型:GPT-3.5、GPT-4、OPT和LLaMA-2。本文提出的评估揭示了社会对齐与当前大型语言模型能力之间的差距。此外,实施类似我们的测试套件可降低确保模型安全与公平所需的环境成本。