We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination. This method involves refining model outputs through an ensemble of critics and the model's own feedback. Drawing inspiration from human behavior, we explore whether LLMs can emulate the self-correction process observed in humans who often engage in self-reflection and seek input from others to refine their understanding of complex topics. Our approach is model-agnostic and can be applied across various domains to enhance trustworthiness by addressing fairness, bias, and robustness concerns. We consistently observe performance improvements in LLMs for reducing toxicity and correcting factual errors.
翻译:我们提出一种针对大型语言模型(LLMs)的自我修正机制,以缓解毒性输出与事实幻觉等问题。该方法通过集成评论家(ensemble of critics)与模型自身反馈来优化输出结果。受人类行为启发,我们探索了LLMs能否模拟人类在理解复杂话题时通过自我反思与寻求他人意见进行自我修正的过程。本方法具有模型无关性,可应用于多个领域,通过解决公平性、偏见及鲁棒性问题来提升模型可信度。我们在减少毒性输出与纠正事实错误方面持续观察到LLMs的性能提升。