Calibrating a robot simulator's physics parameters (friction, damping, material stiffness) to match real hardware is often done by hand or with black-box optimizers that reduce error but cannot explain which physical discrepancies drive the error. When sensing is limited to external cameras, the problem is further compounded by perception noise and the absence of direct force or state measurements. We present Vid2Sid, a video-driven system identification pipeline that couples foundation-model perception with a VLM-in-the-loop optimizer that analyzes paired sim-real videos, diagnoses concrete mismatches, and proposes physics parameter updates with natural language rationales. We evaluate our approach on a tendon-actuated finger (rigid-body dynamics in MuJoCo) and a deformable continuum tentacle (soft-body dynamics in PyElastica). On sim2real holdout controls unseen during training, Vid2Sid achieves the best average rank across all settings, matching or exceeding black-box optimizers while uniquely providing interpretable reasoning at each iteration. Sim2sim validation confirms that Vid2Sid recovers ground-truth parameters most accurately (mean relative error under 13\% vs. 28--98\%), and ablation analysis reveals three calibration regimes. VLM-guided optimization excels when perception is clean and the simulator is expressive, while model-class limitations bound performance in more challenging settings.
翻译:校准机器人仿真器的物理参数(如摩擦、阻尼、材料刚度)以匹配真实硬件,通常通过手动调整或使用黑盒优化器完成。这些方法虽能减少误差,却无法解释具体哪些物理差异导致了误差。当感知仅限于外部摄像头时,该问题会因感知噪声以及缺乏直接的力或状态测量而进一步复杂化。本文提出Vid2Sid,一种视频驱动的系统辨识流程。该流程将基础模型的感知能力与一个VLM-in-the-loop优化器相结合,通过分析成对的仿真-现实视频,诊断具体的不匹配之处,并以自然语言解释的形式提出物理参数更新建议。我们在肌腱驱动手指(基于MuJoCo的刚体动力学)和可变形连续体触手(基于PyElastica的软体动力学)上评估了该方法。在训练中未见过的仿真到现实控制任务中,Vid2Sid在所有设置下取得了最佳的平均排名,其性能匹配或超越了黑盒优化器,同时独特地在每次迭代中提供了可解释的推理过程。仿真到仿真的验证证实,Vid2Sid能最准确地恢复真实参数(平均相对误差低于13%,对比基准方法的28%至98%),消融分析揭示了三种校准机制。当感知数据清晰且仿真器表达能力充足时,VLM引导的优化表现出色;而在更具挑战性的场景中,模型类别的局限性则成为性能瓶颈。