PPTArena：面向代理式PowerPoint编辑的基准测试 (PPTArena: A Benchmark for Agentic PowerPoint Editing)

We introduce PPTArena, a benchmark for PowerPoint editing that measures reliable modifications to real slides under natural-language instructions. In contrast to image-PDF renderings or text-to-slide generation, PPTArena focuses on in-place editing across 100 decks, 2125 slides, and over 800 targeted edits covering text, charts, tables, animations, and master-level styles. Each case includes a ground-truth deck, a fully specified target outcome, and a dual VLM-as-judge pipeline that separately scores instruction following and visual quality using both structural diffs and slide images. Building on this setting, we propose PPTPilot, a structure-aware slide-editing agent that plans semantic edit sequences, routes between high-level programmatic tools and deterministic XML operations for precise control, and verifies outputs through an iterative plan-edit-check loop against task-specific constraints. In our experiments, PPTPilot outperforms strong proprietary agents and frontier VLM systems by over 10 percentage points on compound, layout-sensitive, and cross-slide edits, with particularly large gains in visual fidelity and deck-wide consistency. Despite these improvements, existing agents still underperform on long-horizon, document-scale tasks in PPTArena, highlighting the remaining challenges in reliable PPT editing.

翻译：我们提出了PPTArena，一个用于PowerPoint编辑的基准测试，旨在衡量在自然语言指令下对真实幻灯片的可靠修改。与图像-PDF渲染或文本到幻灯片生成不同，PPTArena专注于对100个演示文稿、2125张幻灯片以及超过800个目标编辑（涵盖文本、图表、表格、动画和母版级样式）进行原位编辑。每个案例包括一个基准演示文稿、一个完全指定的目标结果，以及一个双VLM作为评判的流程，该流程通过结构差异和幻灯片图像分别对指令遵循度和视觉质量进行评分。基于此设置，我们提出了PPTPilot，一种结构感知的幻灯片编辑代理，它规划语义编辑序列，在高级程序化工具和确定性XML操作之间进行路由以实现精确控制，并通过针对任务特定约束的迭代计划-编辑-检查循环来验证输出。在我们的实验中，PPTPilot在复合编辑、布局敏感编辑和跨幻灯片编辑方面，比强大的专有代理和前沿VLM系统高出超过10个百分点，在视觉保真度和演示文稿整体一致性方面尤其取得了显著提升。尽管有这些改进，现有代理在PPTArena中长期、文档规模的任务上仍然表现不佳，突显了可靠PPT编辑中仍存在的挑战。

相关内容