We present ANVIL, a multimodal generative system that automates the production of analogy-based instructional animations for computer science topics. Given a concept definition, ANVIL generates a textual analogy, compiles it into a structured visual screenplay, and produces executable manim code to render an animation, with an automated repair mechanism to improve robustness. Evaluating such systems at scale requires balancing pedagogical validity with scalability. We begin with a teacher evaluation to ground the quality assessment and use its findings to guide automated screening. For textual analogies, we introduce an LLM-based evaluator for scalable quality screening; for videos, where subjective judgments are difficult to automate, we instead assess fidelity to the intended screenplay using an automated proxy for auditing and error analysis. We further conduct a user study with educators to examine adoption requirements and risks. Our findings suggest that ANVIL can produce materials that are frequently rated as adequate, and that educators respond positively to its perceived value and usability.
翻译:我们提出了ANVIL,一个多模态生成系统,能够自动为计算机科学主题制作基于类比的教学动画。给定一个概念定义,ANVIL会生成文本类比,将其编译为结构化的视觉脚本,并产生可执行的manim代码以渲染动画,同时配备自动化修复机制以提升鲁棒性。对此类系统进行规模化评估需要在教学有效性与可扩展性之间取得平衡。我们首先通过教师评估来确立质量评价基准,并利用其结论指导自动化筛选。对于文本类比,我们引入基于大语言模型的评估器以实现可扩展的质量筛选;对于视频,由于主观判断难以自动化,我们转而采用自动化代理来评估其对预期脚本的忠实度,以进行审计和错误分析。此外,我们还对教育工作者进行了用户研究,以考察其采用需求与潜在风险。研究结果表明,ANVIL能够生成经常被评定为合格的教学材料,且教育工作者对其感知价值与易用性持积极态度。