HumanLLM: Towards Personalized Understanding and Simulation of Human Nature

Motivated by the remarkable progress of large language models (LLMs) in objective tasks like mathematics and coding, there is growing interest in their potential to simulate human behavior--a capability with profound implications for transforming social science research and customer-centric business insights. However, LLMs often lack a nuanced understanding of human cognition and behavior, limiting their effectiveness in social simulation and personalized applications. We posit that this limitation stems from a fundamental misalignment: standard LLM pretraining on vast, uncontextualized web data does not capture the continuous, situated context of an individual's decisions, thoughts, and behaviors over time. To bridge this gap, we introduce HumanLLM, a foundation model designed for personalized understanding and simulation of individuals. We first construct the Cognitive Genome Dataset, a large-scale corpus curated from real-world user data on platforms like Reddit, Twitter, Blogger, and Amazon. Through a rigorous, multi-stage pipeline involving data filtering, synthesis, and quality control, we automatically extract over 5.5 million user logs to distill rich profiles, behaviors, and thinking patterns. We then formulate diverse learning tasks and perform supervised fine-tuning to empower the model to predict a wide range of individualized human behaviors, thoughts, and experiences. Comprehensive evaluations demonstrate that HumanLLM achieves superior performance in predicting user actions and inner thoughts, more accurately mimics user writing styles and preferences, and generates more authentic user profiles compared to base models. Furthermore, HumanLLM shows significant gains on out-of-domain social intelligence benchmarks, indicating enhanced generalization.

翻译：受大语言模型（LLM）在数学与编程等客观任务中取得显著进展的驱动，学界日益关注其模拟人类行为的潜力——这种能力对变革社会科学研究及以客户为中心的商业洞察具有深远意义。然而，LLM通常缺乏对人类认知与行为的细致理解，限制了其在社会模拟与个性化应用中的有效性。我们认为这一局限源于根本性的错位：基于海量无上下文网络数据的标准LLM预训练未能捕捉个体随时间推移的连续、情境化决策、思维与行为模式。为弥合此差距，我们提出了HumanLLM——一个为个体个性化理解与模拟设计的基础模型。我们首先构建了认知基因组数据集，这是一个从Reddit、Twitter、Blogger和Amazon等平台真实用户数据中整理的大规模语料库。通过包含数据过滤、合成与质量控制的多阶段严谨流程，我们自动提取了超过550万条用户日志，以提炼丰富的用户画像、行为模式与思维特征。随后，我们设计了多样化的学习任务并进行监督微调，使模型能够预测广泛的个性化人类行为、思维与经历。综合评估表明，相较于基础模型，HumanLLM在预测用户行为与内在思维方面表现更优，能更准确地模仿用户写作风格与偏好，并生成更真实的用户画像。此外，HumanLLM在领域外社会智能基准测试中显示出显著提升，表明其泛化能力得到增强。