Recent advances in large language models (LLMs) have led to the development of powerful AI chatbots capable of engaging in natural and human-like conversations. However, these chatbots can be potentially harmful, exhibiting manipulative, gaslighting, and narcissistic behaviors. We define Healthy AI to be safe, trustworthy and ethical. To create healthy AI systems, we present the SafeguardGPT framework that uses psychotherapy to correct for these harmful behaviors in AI chatbots. The framework involves four types of AI agents: a Chatbot, a "User," a "Therapist," and a "Critic." We demonstrate the effectiveness of SafeguardGPT through a working example of simulating a social conversation. Our results show that the framework can improve the quality of conversations between AI chatbots and humans. Although there are still several challenges and directions to be addressed in the future, SafeguardGPT provides a promising approach to improving the alignment between AI chatbots and human values. By incorporating psychotherapy and reinforcement learning techniques, the framework enables AI chatbots to learn and adapt to human preferences and values in a safe and ethical way, contributing to the development of a more human-centric and responsible AI.
翻译:近期大型语言模型(LLMs)的进展催生了能进行自然类人对话的强大AI聊天机器人。然而,这些聊天机器人可能潜藏危害,表现出操纵、煤气灯效应(gaslighting)及自恋行为。我们将健康人工智能定义为安全、可信且合乎伦理的系统。为构建此类系统,我们提出SafeguardGPT框架,通过心理治疗方法纠正AI聊天机器人中的有害行为。该框架涉及四类智能体:聊天机器人、用户、心理治疗师与评论家。我们通过模拟社交对话的实例验证了SafeguardGPT的有效性。结果表明,该框架能够提升AI聊天机器人与人类之间的对话质量。尽管未来仍面临若干挑战与研究方向,SafeguardGPT为改善AI聊天机器人对人类价值观的契合度提供了可行路径。通过融合心理治疗与强化学习技术,该框架使AI聊天机器人能以安全且符合伦理的方式学习并适应人类偏好与价值观,助力开发更以人为中心且负责任的AI系统。