As Large Language Models (LLMs) continue to gain popularity due to their human-like traits and the intimacy they offer to users, their societal impact inevitably expands. This leads to the rising necessity for comprehensive studies to fully understand LLMs and reveal their potential opportunities, drawbacks, and overall societal impact. With that in mind, this research conducted an extensive investigation into seven LLM's, aiming to assess the temporal stability and inter-rater agreement on their responses on personality instruments in two time points. In addition, LLMs personality profile was analyzed and compared to human normative data. The findings revealed varying levels of inter-rater agreement in the LLMs responses over a short time, with some LLMs showing higher agreement (e.g., LIama3 and GPT-4o) compared to others (e.g., GPT-4 and Gemini). Furthermore, agreement depended on used instruments as well as on domain or trait. This implies the variable robustness in LLMs' ability to reliably simulate stable personality characteristics. In the case of scales which showed at least fair agreement, LLMs displayed mostly a socially desirable profile in both agentic and communal domains, as well as a prosocial personality profile reflected in higher agreeableness and conscientiousness and lower Machiavellianism. Exhibiting temporal stability and coherent responses on personality traits is crucial for AI systems due to their societal impact and AI safety concerns.
翻译:随着大型语言模型(LLMs)因其类人特质及为用户提供的亲密感而持续受到欢迎,其社会影响不可避免地扩大。这导致全面研究的需求日益增长,以充分理解LLMs并揭示其潜在机遇、缺陷及整体社会影响。鉴于此,本研究对七种LLMs进行了广泛调查,旨在评估其在两个时间点上对人格测量工具回应的时态稳定性及评分者间一致性。此外,我们分析了LLMs的人格特征剖面,并与人类常模数据进行了比较。研究结果显示,在短时间内,不同LLMs的回应呈现出不同程度的评分者间一致性:部分模型(如LIama3和GPT-4o)表现出较高的一致性,而其他模型(如GPT-4和Gemini)则相对较低。此外,一致性程度受所用测量工具及具体领域或特质的影响。这表明LLMs可靠模拟稳定人格特征的能力存在差异性。在至少达到中等一致性水平的量表中,LLMs在能动性与共生性领域均主要呈现出社会期望型特征剖面,同时表现出亲社会人格剖面,具体体现为较高的宜人性与尽责性以及较低的马基雅维利主义倾向。由于人工智能系统的社会影响及AI安全性考量,在人格特质上展现时态稳定性与连贯回应能力至关重要。