We propose a self-correction mechanism for Large Language Models (LLMs) to mitigate issues such as toxicity and fact hallucination. This method involves refining model outputs through an ensemble of critics and the model's own feedback. Drawing inspiration from human behavior, we explore whether LLMs can emulate the self-correction process observed in humans who often engage in self-reflection and seek input from others to refine their understanding of complex topics. Our approach is model-agnostic and can be applied across various domains to enhance trustworthiness by addressing fairness, bias, and robustness concerns. We consistently observe performance improvements in LLMs for reducing toxicity and correcting factual errors.
翻译:我们提出了一种针对大型语言模型(LLM)的自我纠正机制,旨在缓解毒性输出和事实幻觉等问题。该方法通过集成多个批评家模型及模型自身的反馈信号对输出结果进行精炼。受人类行为启发,我们探索了LLM能否模拟人类在复杂话题理解中常采用的自我反思与寻求他人意见的自我纠正过程。我们的方法具有模型无关性,可跨领域应用,通过解决公平性、偏差和鲁棒性问题提升可信度。实验表明,该方法在降低毒性输出和修正事实错误方面持续提升了LLM的性能表现。