MeepleLM: A Virtual Playtester Simulating Diverse Subjective Experiences

Recent advancements have expanded the role of Large Language Models in board games from playing agents to creative co-designers. However, a critical gap remains: current systems lack the capacity to offer constructive critique grounded in the emergent user experience. Bridging this gap is fundamental for harmonizing Human-AI collaboration, as it empowers designers to refine their creations via external perspectives while steering models away from biased or unpredictable outcomes. Automating critique for board games presents two challenges: inferring the latent dynamics connecting rules to gameplay without an explicit engine, and modeling the subjective heterogeneity of diverse player groups. To address these, we curate a dataset of 1,727 structurally corrected rulebooks and 150K reviews selected via quality scoring and facet-aware sampling. We augment this data with Mechanics-Dynamics-Aesthetics (MDA) reasoning to explicitly bridge the causal gap between written rules and player experience. We further distill player personas and introduce MeepleLM, a specialized model that internalizes persona-specific reasoning patterns to accurately simulate the subjective feedback of diverse player archetypes. Experiments demonstrate that MeepleLM significantly outperforms latest commercial models (e.g., GPT-5.1, Gemini3-Pro) in community alignment and critique quality, achieving a 70% preference rate in user studies assessing utility. MeepleLM serves as a reliable virtual playtester for general interactive systems, marking a pivotal step towards audience-aligned, experience-aware Human-AI collaboration.

翻译：近期研究进展已将大型语言模型在桌游中的角色从游戏代理拓展至创意协同设计者。然而，当前系统仍存在关键缺陷：缺乏基于涌现用户体验的建构性批判能力。弥合这一差距对于协调人机协作至关重要，它使设计者能够通过外部视角完善创作，同时引导模型避免偏见或不可预测的结果。实现桌游批判自动化面临两大挑战：在缺乏显式游戏引擎的情况下推断连接规则与游戏过程的潜在动态，以及建模多样化玩家群体的主观异质性。为此，我们构建了包含1,727份结构校正规则书和15万条评论的数据集，通过质量评分与多维度感知采样进行筛选。我们运用机制-动态-美学（MDA）推理框架增强数据，显式弥合书面规则与玩家体验间的因果鸿沟。进一步提炼玩家角色原型并推出MeepleLM——该专用模型通过内化角色特定的推理模式，能精准模拟不同玩家类型的个性化反馈。实验表明，MeepleLM在社区契合度与批判质量上显著优于最新商业模型（如GPT-5.1、Gemini3-Pro），在效用评估的用户研究中获得70%的偏好率。MeepleLM可作为通用交互系统的可靠虚拟测试员，标志着向受众对齐、体验感知的人机协作迈出关键一步。