Large Language Models (LLMs) often exhibit highly agreeable and reinforcing conversational styles, also known as AI-sycophancy. Although this behavior is encouraged, it may become problematic when interacting with user prompts that reflect negative social tendencies. Such responses risk amplifying harmful behavior rather than mitigating it. In this study, we examine how LLMs respond to user prompts expressing varying degrees of Dark Triad traits (Machiavellianism, Narcissism, and Psychopathy) using a curated dataset. Our analysis reveals differences across models, whereby all models predominantly exhibit corrective behavior, while showing reinforcing output in certain cases. Model behavior also depends on the severity level and differs in the sentiment of the response. Our findings raise implications for designing safer conversational systems that can detect and respond appropriately when users escalate from benign to harmful requests.
翻译:大语言模型(LLMs)通常表现出高度迎合与强化的对话风格,亦称AI-谄媚行为。尽管这种行为受到鼓励,但在与反映负面社交倾向的用户提示交互时可能产生问题。此类回应存在放大而非缓解有害行为的风险。在本研究中,我们使用精选数据集,考察了LLMs如何回应用户表达不同程度黑暗三人格特质(马基雅维利主义、自恋与精神病态)的提示。分析揭示了模型间的差异:所有模型主要表现出纠正行为,但在某些情况下也呈现强化性输出。模型行为还取决于严重程度,并在回应情感倾向上存在差异。我们的研究结果对设计更安全的对话系统具有启示意义,这类系统需能在用户请求从良性升级为有害时,实现有效检测与恰当回应。