Validation of LLM-agent social simulations remains underdeveloped, with most studies relying on subjective assessments or single runs. We address this gap by running 30 independent 30-day simulations of a technology forum modeled on Voat's v/technology, using stateless Dolphin Mistral 24B agents on the Y Social platform, and evaluating operational validity across five dimensions: activity patterns, network structure, toxicity, topical coverage, and stylistic convergence. Against 30 matched, non-overlapping 30-day Voat comparison windows, results show overlapping 99% confidence intervals for unique users, root posts, and daily active users, while comments, average thread length, and mean toxicity remain higher in simulation. Both simulated and empirical networks exhibit core-periphery structure, though simulated cores are larger and more diffuse and repeated interactions are less frequent. Topic alignment is near-complete, but toxicity is misallocated across content layers: simulated root posts are substantially more toxic than real submissions, while simulated comments are less toxic than Voat comments. These findings demonstrate that LLM agents in platform-faithful environments can reproduce familiar online regularities, while systematic divergences, particularly those linked to stateless agent design and content-layer calibration, point to concrete directions for future improvement.
翻译:LLM智能体社会模拟的验证仍不成熟,多数研究依赖主观评估或单次运行。为填补这一空白,我们以Voat的v/technology板块为模型,在Y Social平台上使用无状态Dolphin Mistral 24B智能体,运行了30次独立的30天技术论坛模拟,并从活动模式、网络结构、毒性、主题覆盖和风格趋同五个维度评估操作有效性。与30个匹配的非重叠30天Voat对比窗口相比,结果显示独特用户数、根帖数和日活跃用户数的99%置信区间存在重叠,而评论数、平均帖子链长度和平均毒性在模拟中仍较高。模拟网络与实际网络均呈现核心-边缘结构,但模拟核心更大更分散,重复交互频率较低。主题对齐几乎完全一致,但毒性在不同内容层级上分配不当:模拟根帖的毒性显著高于真实帖子,而模拟评论的毒性低于Voat评论。这些发现表明,在平台忠实环境中,LLM智能体能够复现熟悉的在线规律,而系统性差异(特别是与无状态智能体设计和内容层级校准相关的差异)为未来改进指明了具体方向。