Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. A systematic study of how training and inference strategies interact in this setting is lacking in current research. This study introduces ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinforcement Learning (RL) based Group Relative Policy Optimisation (GRPO) using a unified reward signal that fuses code and visual assessment signals, and ManimAgent, an inference pipeline featuring Renderer-in-the-loop (RITL) and API documentation-augmented RITL (RITL-DOC) strategies. Using these techniques, this study presents the first unified training and inference study for text-to-code-to-video transformation with Manim. It evaluates 17 open-source sub-30B LLMs across nine combinations of training and inference strategies using ManimBench. Results show that SFT generally improves code quality, while GRPO enhances visual outputs and increases the models' responsiveness to extrinsic signals during self-correction at inference time. The Qwen 3 Coder 30B model with GRPO and RITL-DOC achieved the highest overall performance, with a 94% Render Success Rate (RSR) and 85.7% Visual Similarity (VS) to reference videos, surpassing the baseline GPT-4.1 model by +3 percentage points in VS. Additionally, the analysis shows that the correlation between code and visual metrics strengthens with SFT and GRPO but weakens with inference-time enhancements, highlighting the complementary roles of training and agentic inference strategies in Manim animation generation.
翻译:使用Manim等库生成程序化动画对大型语言模型(LLMs)提出了独特挑战,需要空间推理、时间序列理解,以及对通用预训练数据中代表性不足的领域特定API的熟悉程度。当前研究缺乏对训练与推理策略在该场景下如何相互作用的系统性分析。本研究引入ManimTrainer训练流程,该流程结合了监督微调(SFT)与基于强化学习(RL)的组相对策略优化(GRPO),采用融合代码与视觉评估信号的统一奖励信号;同时引入ManimAgent推理流程,其包含渲染器在环(RITL)与API文档增强型RITL(RITL-DOC)策略。基于这些技术,本研究首次对基于Manim的文本-代码-视频转换进行了统一的训练与推理研究。利用ManimBench对17个开源、参数规模低于30B的LLM,在九种训练与推理策略组合上进行了评估。结果表明,SFT普遍提升了代码质量,而GRPO则增强了视觉输出,并提高了模型在推理过程中自我修正时对外部信号的响应能力。采用GRPO与RITL-DOC策略的Qwen 3 Coder 30B模型取得了最佳整体性能,其渲染成功率(RSR)达94%,与参考视频的视觉相似度(VS)达85.7%,在VS指标上比基线GPT-4.1模型高出3个百分点。此外,分析表明,代码指标与视觉指标之间的相关性在应用SFT和GRPO后得到增强,但在推理阶段增强时有所减弱,这凸显了训练策略与智能推理策略在Manim动画生成中的互补作用。