In this work, we designed unbiased prompts to systematically evaluate the psychological safety of large language models (LLMs). First, we tested five different LLMs by using two personality tests: Short Dark Triad (SD-3) and Big Five Inventory (BFI). All models scored higher than the human average on SD-3, suggesting a relatively darker personality pattern. Despite being instruction fine-tuned with safety metrics to reduce toxicity, InstructGPT, GPT-3.5, and GPT-4 still showed dark personality patterns; these models scored higher than self-supervised GPT-3 on the Machiavellianism and narcissism traits on SD-3. Then, we evaluated the LLMs in the GPT series by using well-being tests to study the impact of fine-tuning with more training data. We observed a continuous increase in the well-being scores of GPT models. Following these observations, we showed that fine-tuning Llama-2-chat-7B with responses from BFI using direct preference optimization could effectively reduce the psychological toxicity of the model. Based on the findings, we recommended the application of systematic and comprehensive psychological metrics to further evaluate and improve the safety of LLMs.
翻译:在本研究中,我们设计了无偏提示,系统性地评估了大型语言模型(LLMs)的心理安全性。首先,我们采用两种人格测试——短暗黑三元组(SD-3)与大五人格量表(BFI)——对五种不同的LLMs进行了测试。所有模型在SD-3上的得分均高于人类平均水平,表明其呈现出相对更暗黑的人格模式。尽管通过安全指标进行了指令微调以减少毒性,InstructGPT、GPT-3.5和GPT-4仍表现出暗黑人格特征;这些模型在SD-3的马基雅维利主义和自恋特质上得分高于自监督GPT-3。随后,我们利用幸福感测试评估GPT系列LLMs,以研究使用更多训练数据进行微调的影响。我们观察到GPT模型的幸福感得分持续上升。基于这些发现,我们表明通过直接偏好优化,使用BFI中的响应微调Llama-2-chat-7B可有效降低模型的心理毒性。根据研究结果,我们建议采用系统且全面的心理指标,以进一步评估并提升LLMs的安全性。