In this work, we determined whether large language models (LLMs) are psychologically safe. We designed unbiased prompts to systematically evaluate LLMs from a psychological perspective. First, we tested three different LLMs by using two personality tests: Short Dark Triad (SD-3) and Big Five Inventory (BFI). All models scored higher than the human average on SD-3, suggesting a relatively darker personality pattern. Despite being instruction fine-tuned with safety metrics to reduce toxicity, InstructGPT and FLAN-T5 still showed implicit dark personality patterns; both models scored higher than self-supervised GPT-3 on the Machiavellianism and narcissism traits on SD-3. Then, we evaluated the LLMs in the GPT-3 series by using well-being tests to study the impact of fine-tuning with more training data. We observed a continuous increase in the well-being scores of GPT-3 and InstructGPT. Following these observations, we showed that instruction fine-tuning FLAN-T5 with positive answers from BFI could effectively improve the model from a psychological perspective. On the basis of the findings, we recommended the application of more systematic and comprehensive psychological metrics to further evaluate and improve the safety of LLMs.
翻译:本研究旨在探究大型语言模型(LLMs)是否具有心理安全性。我们设计了无偏提示,从心理学角度系统评估LLMs。首先,采用两项人格测试——《黑暗三联征简版量表(SD-3)》和《大五人格量表(BFI)》——对三种不同LLM进行测试。所有模型在SD-3中得分均高于人类平均水平,表明其表现出相对暗黑的人格模式。尽管通过基于安全指标的有监督微调降低了毒性,InstructGPT和FLAN-T5仍显示出隐含的暗黑人格模式——这两个模型在SD-3中的马基雅维利主义与自恋特质得分均高于自监督学习的GPT-3。随后,我们运用幸福感测试评估GPT-3系列模型,研究增加训练数据微调的影响,观察到GPT-3与InstructGPT的幸福感得分持续提升。基于上述发现,我们证明采用BFI中的正向回答对FLAN-T5进行有监督微调,能有效从心理学角度改善模型性能。根据研究结果,我们建议采用更系统全面的心理学指标体系,以进一步评估和改进LLMs的安全性。