Students benefit from math problems contextualized to their interests. Large language models (LLMs) offer promise for efficient personalization at scale. However, LLM-generated personalized problems may often have problems such as unrealistic quantities and contexts, poor readability, limited authenticity with respect to students' experiences, and occasional mathematical inconsistencies. To alleviate these problems, we propose a multi-agent framework that formalizes personalization as an iterative generate--validate--revise process; we use four specialized validator agents targeting the criteria of solvability, realism, readability, and authenticity, respectively. We evaluate our framework on 600 problems drawn from a popular online mathematics homework platform, ASSISTments, personalizing each problem to a fixed set of 20 student interest topics. We compare three refinement strategies that differ in how validation feedback is coordinated into revisions. Results show that authenticity and realism are the most frequent failure modes in initial LLM-personalized problems, but that a single refinement iteration substantially reduces these failures. We further find that different refinement strategies have different strengths on different criteria. We also assess validator reliability via human evaluation. Results show that reliability is highest on realism and lowest on authenticity, highlighting the need for better evaluation protocols that consider teachers' and students' personal characteristics.
翻译:学生能从与自身兴趣相关的数学问题中获益。大语言模型为大规模高效个性化提供了可能,但模型生成的个性化问题常存在数量与背景不切实际、可读性差、与学生实际体验关联度低、偶有数学不一致性等问题。针对这些缺陷,我们提出一种多智能体框架,将个性化过程形式化为迭代式"生成-验证-修订"流程;并针对可解性、现实性、可读性与真实性四个维度分别设置专项验证智能体。我们在流行在线数学作业平台ASSISTments的600道题目上开展评估,将每道题目个性化适配至固定20个学生兴趣主题。对比三种不同验证反馈协调机制的精炼策略后,结果显示:初始LLM个性化题目中最常见的失效模式是真实性与现实性不足,但单次精炼迭代即可显著降低此类缺陷。进一步发现,不同精炼策略在不同评价维度上各有优势。我们还通过人工评估验证智能体可靠性,结果表明现实性维度的可靠性最高,真实性维度最低,这凸显出亟需建立考虑师生个人特征的更优评估协议。