Learning to Present: Inverse Specification Rewards for Agentic Slide Generation

from arxiv, 12 pages, 11 figures, 13 tables, 26 references. Code: https://github.com/pushing-the-frontier/slide-forge-llm Dataset: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts

Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an "inverse task" where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine-tunes Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine-tuned 7B model achieves 91.2% of Claude Opus 4.6's quality while improving 33.1% over the base model. The six-model comparison reveals that instruction adherence and tool-use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open-source dataset of 288 multi-turn rollout trajectories across all six models: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts Code: https://github.com/pushing-the-frontier/slide-forge-llm

翻译：自动化演示文稿生成仍然是一项具有挑战性的任务，需要连贯的内容创作、视觉设计和面向受众的沟通。本研究提出了一个与OpenEnv兼容的强化学习环境，其中LLM智能体通过学习使用工具来研究主题、规划内容并生成专业的HTML幻灯片演示文稿。我们引入了一个多组件奖励系统，该系统结合了结构验证、渲染质量评估、基于LLM的审美评分、内容质量指标以及一种逆向规范奖励——该奖励通过衡量生成的幻灯片在多大程度上忠实传达了其预期目的来提供整体质量信号。逆向规范奖励是一种“逆向任务”，即LLM尝试从生成的幻灯片中恢复原始规范。我们的方法通过GRPO对Qwen2.5-Coder-7B进行微调，仅使用基于Claude Opus 4.6收集的专家演示推导出的提示，对0.5%的参数进行训练。在48份涵盖不同领域的商业简报上对六个模型进行的实验表明，我们微调后的7B模型达到了Claude Opus 4.6质量的91.2%，同时比基础模型提升了33.1%。六模型比较揭示，指令遵循和工具使用合规性，而非原始参数数量，决定了智能体任务性能。我们贡献了SlideRL，一个包含所有六个模型共288条多轮次展开轨迹的开源数据集：https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts 代码：https://github.com/pushing-the-frontier/slide-forge-llm