The advancing fluency of LLMs raises important questions about their ability to emulate complex human traits, including emotional expression and personality, across diverse linguistic and cultural contexts. This study investigates whether LLMs can convincingly mimic emotional nuance in English and personality markers in Arabic, a critical under-resourced language with unique linguistic and cultural characteristics. We conduct two tasks across six models:Jais, Mistral, LLaMA, GPT-4o, Gemini, and DeepSeek. First, we evaluate whether machine classifiers can reliably distinguish between human-authored and AI-generated texts. Second, we assess the extent to which LLM-generated texts exhibit emotional or personality traits comparable to those of humans. Our results demonstrate that AI-generated texts are distinguishable from human-authored ones (F1>0.95), though classification performance deteriorates on paraphrased samples, indicating a reliance on superficial stylistic cues. Emotion and personality classification experiments reveal significant generalization gaps: classifiers trained on human data perform poorly on AI-generated texts and vice versa, suggesting LLMs encode affective signals differently from humans. Importantly, augmenting training with AI-generated data enhances performance in the Arabic personality classification task, highlighting the potential of synthetic data to address challenges in under-resourced languages. Model-specific analyses show that GPT-4o and Gemini exhibit superior affective coherence. Linguistic and psycholinguistic analyses reveal measurable divergences in tone, authenticity, and textual complexity between human and AI texts. These findings have implications for affective computing, authorship attribution, and responsible AI deployment, particularly within underresourced language contexts where generative AI detection and alignment pose unique challenges.
翻译:大型语言模型(LLM)流畅性的提升引发重要问题:它们能否在不同语言和文化背景下,可靠地模仿复杂的人类特征,包括情感表达与个性特征?本研究探究LLM是否能在英语中令人信服地模拟情感细微差异,以及在阿拉伯语(一种具有独特语言文化特征但资源严重匮乏的关键语言)中模拟个性标记。我们基于六种模型(Jais、Mistral、LLaMA、GPT-4o、Gemini与DeepSeek)开展两项任务:首先,评估机器分类器能否可靠区分人类撰写文本与AI生成文本;其次,评估LLM生成文本所表现的情感或个性特征在多大程度上可与人类相媲美。结果表明,AI生成文本与人类撰写文本具有可区分性(F1 > 0.95),但在经改写处理的样本上分类性能下降,表明其依赖表层风格线索。情感与个性分类实验揭示了显著的泛化鸿沟:基于人类数据训练的分类器对AI生成文本表现不佳,反之亦然,表明LLM编码情感信号的方式与人类存在差异。值得注意的是,在阿拉伯语个性分类任务中,采用AI生成数据增强训练可提升性能,凸显合成数据在解决资源匮乏语言挑战中的潜力。模型特定分析显示,GPT-4o与Gemini表现出更优的情感连贯性。语言与心理语言学分析揭示了人类文本与AI文本在语气、真实性与文本复杂性上的可测量差异。这些发现对情感计算、作者身份归属及负责任AI部署具有启示意义,尤其在生成式AI检测与对齐面临独特挑战的资源匮乏语言环境中。