AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards

Large language models (LLMs) have demonstrated strong potential in agentic tasks, particularly in slide generation. However, slide generation poses a fundamental challenge: the generation process is text-centric, whereas its quality is governed by visual aesthetics. This modality gap leads current models to frequently produce slides with aesthetically suboptimal layouts. Existing solutions typically rely either on heavy visual reflection, which incurs high inference cost yet yields limited gains; or on fine-tuning with large-scale datasets, which still provides weak and indirect aesthetic supervision. In contrast, the explicit use of aesthetic principles as supervision remains unexplored. In this work, we present AeSlides, a reinforcement learning framework with verifiable rewards for Aesthetic layout supervision in Slide generation. We introduce a suite of meticulously designed verifiable metrics to quantify slide layout quality, capturing key layout issues in an accurate, efficient, and low-cost manner. Leveraging these verifiable metrics, we develop a GRPO-based reinforcement learning method that directly optimizes slide generation models for aesthetically coherent layouts. With only 5K training prompts on GLM-4.7-Flash, AeSlides improves aspect ratio compliance from 36% to 85%, while reducing whitespace by 44%, element collisions by 43%, and visual imbalance by 28%. Human evaluation further shows a substantial improvement in overall quality, increasing scores from 3.31 to 3.56 (+7.6%), outperforming both model-based reward optimization and reflection-based agentic approaches, and even edging out Claude-Sonnet-4.5. These results demonstrate that such a verifiable aesthetic paradigm provides an efficient and scalable approach to aligning slide generation with human aesthetic preferences. Our repository is available at https://github.com/ympan0508/aeslides.

翻译：大语言模型（LLM）在智能体任务中展现出强大潜力，尤其在幻灯片生成方面。然而，幻灯片生成面临一个根本性挑战：生成过程以文本为中心，而其质量却由视觉美学决定。这种模态差距导致当前模型生成的幻灯片常存在布局美学欠佳的问题。现有解决方案通常依赖两种方式：一是繁重的视觉反思，推理成本高但收益有限；二是通过大规模数据集微调，但美学监督信号仍然薄弱且间接。相比之下，将美学原则直接作为监督信号进行显式利用的研究尚属空白。本文提出AeSlides，一种基于可验证奖励的美学布局监督强化学习框架。我们设计了一套精心制定的可验证指标，以量化幻灯片布局质量，能够准确、高效、低成本地捕获关键布局问题。借助这些可验证指标，我们开发了一种基于GRPO的强化学习方法，直接优化幻灯片生成模型以实现美学协调布局。仅使用GLM-4.7-Flash上的5000条训练提示，AeSlides就将宽高比合规率从36%提升至85%，同时减少了44%的空白区域、43%的元素碰撞和28%的视觉失衡。人工评估进一步显示，整体质量评分从3.31提升至3.56（+7.6%），优于基于模型的奖励优化方法和基于反思的智能体方法，甚至略胜Claude-Sonnet-4.5。这些结果表明，这种可验证的美学范式为将幻灯片生成与人类审美偏好对齐提供了一种高效且可扩展的方案。我们的代码仓库已开源：https://github.com/ympan0508/aeslides。