PhysicsSolutionAgent: Towards Multimodal Explanations for Numerical Physics Problem Solving

Explaining numerical physics problems often requires more than text-based solutions; clear visual reasoning can substantially improve conceptual understanding. While large language models (LLMs) demonstrate strong performance on many physics questions in textual form, their ability to generate long, high-quality visual explanations remains insufficiently explored. In this work, we introduce PhysicsSolutionAgent (PSA), an autonomous agent that generates physics-problem explanation videos of up to six minutes using Manim animations. To evaluate the generated videos, we design an assessment pipeline that performs automated checks across 15 quantitative parameters and incorporates feedback from a vision-language model (VLM) to iteratively improve video quality. We evaluate PSA on 32 videos spanning numerical and theoretical physics problems. Our results reveal systematic differences in video quality depending on problem difficulty and whether the task is numerical or theoretical. Using GPT-5-mini, PSA achieves a 100% video-completion rate with an average automated score of 3.8/5. However, qualitative analysis and human inspection uncover both minor and major issues, including visual layout inconsistencies and errors in how visual content is interpreted during feedback. These findings expose key limitations in reliable Manim code generation and highlight broader challenges in multimodal reasoning and evaluation for visual explanations of numerical physics problems. Our work underscores the need for improved visual understanding, verification, and evaluation frameworks in future multimodal educational systems

翻译：解释数值物理问题通常需要超越基于文本的解决方案；清晰的视觉推理能显著提升概念理解。尽管大型语言模型（LLM）在文本形式的物理问题上展现出强大性能，其生成长篇高质量视觉解释的能力仍未得到充分探索。本研究提出PhysicsSolutionAgent（PSA），一种能够使用Manim动画生成长达六分钟物理问题解释视频的自主智能体。为评估生成的视频，我们设计了一个评估流程，该流程对15个量化参数进行自动化检查，并整合视觉语言模型（VLM）的反馈以迭代提升视频质量。我们在涵盖数值与理论物理问题的32个视频上对PSA进行评估。结果显示，视频质量存在系统性差异，具体取决于问题难度以及任务是数值型还是理论型。使用GPT-5-mini时，PSA实现了100%的视频完成率，平均自动化评分为3.8/5。然而，定性分析和人工检查揭示了从轻微到严重的各类问题，包括视觉布局不一致以及在反馈过程中视觉内容解读错误。这些发现揭示了可靠Manim代码生成的关键局限，并凸显了数值物理问题视觉解释在多模态推理与评估方面更广泛的挑战。我们的工作强调了未来多模态教育系统中改进视觉理解、验证与评估框架的必要性。