Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas. HACHIMI factorizes each persona into a theory-anchored educational schema, enforces developmental and psychological constraints via a neuro-symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI-1M corpus comprises 1 million personas for Grades 1-12. Intrinsic evaluation shows near-perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom-climate and well-being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5-72B, and HACHIMI provides a standardized synthetic student population for group-level benchmarking and social-science simulations. Resources available at https://github.com/ZeroLoss-Lab/HACHIMI
翻译:学生画像正逐渐成为教育大语言模型的基础设施,然而现有研究多依赖临时提示或手工构建的档案,对教育理论与人口分布的控制有限。我们将此问题形式化为理论对齐与分布可控的画像生成,并提出了HACHIMI——一个基于“提议-验证-修订”流程的多智能体框架,用于生成理论对齐且配额可控的画像。HACHIMI将每个画像分解为理论锚定的教育模式,通过神经符号验证器强制执行发展性与心理学约束,并结合分层抽样与语义去重以减少模式坍缩。由此产生的HACHIMI-1M语料库包含100万个涵盖1至12年级的学生画像。内在评估显示其模式有效性接近完美、配额准确且多样性显著;外部评估则将画像实例化为参与CEPS与PISA 2022调查的学生智能体:在16个队列中,数学及好奇心/成长构念在人类与智能体间呈现高度一致性,而课堂氛围与幸福感构念仅呈中等一致性,揭示了保真度梯度。所有画像均使用Qwen2.5-72B生成,HACHIMI为群体级基准测试与社会科学模拟提供了标准化的合成学生群体。相关资源详见https://github.com/ZeroLoss-Lab/HACHIMI