Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral diversity. We investigate whether distinct personas influence urban sentiment judgments generated by multimodal LLMs. Using a factorial set of personas spanning gender, economic status, political orientation, and personality, we instantiate multiple agents per persona to evaluate urban scene images from the PerceptSent dataset and assess both within-persona consistency and cross-persona variation. Results show strong convergence among agents sharing a persona, indicating stable and reproducible behavior. However, cross-persona differentiation is limited: economic status and personality induce statistically detectable but practically modest variation, while gender shows no measurable effect and political orientation only negligible impact. Agents also exhibit an extremity bias, collapsing intermediate sentiment categories common in human annotations. As a result, performance remains strong on coarse-grained polarity tasks but degrades as sentiment resolution increases, suggesting that simple label-based persona prompting does not capture fine-grained perceptual judgments. To isolate the contribution of persona conditioning, we additionally evaluate the same model without personas. Surprisingly, the no-persona model sometimes matches or exceeds persona-conditioned agreement with human labels across all task variants, suggesting that simple label-based persona prompting may add limited annotation value in this setting.

翻译：大型语言模型（LLMs）正日益被用作城市分析中人类感知的替代指标，然而，人格提示是否能够产生有意义且可重复的行为多样性仍不明确。本研究探究不同人格是否会影响多模态LLMs生成的城市情感判断。我们采用涵盖性别、经济状况、政治倾向和人格特质的阶乘人格组合，为每种人格实例化多个智能体，以评估来自PerceptSent数据集的城市场景图像，并分析人格内一致性与跨人格变异性。结果表明，共享同一人格的智能体之间存在高度收敛，表明行为稳定且可重复。然而，跨人格区分度有限：经济状况和人格特质可检测到统计学上显著但实际影响微弱的变化，而性别无显著效应，政治倾向影响可忽略不计。智能体还表现出极端性偏差，压缩了人类标注中常见的中间情感类别。因此，模型在粗粒度极性任务上表现强劲，但随着情感分辨率的提高性能下降，这表明基于简单标签的人格提示无法捕捉细粒度感知判断。为隔离人格条件化的贡献，我们额外评估了无人格设置的同一模型。令人惊讶的是，在所有任务变体中，无人格模型有时在人类标签一致性上达到或超过人格条件化模型的表现，表明在此场景下，简单基于标签的人格提示可能仅带来有限的标注价值。