Inverse Dynamics Models (IDMs) map visual observations to low-level action commands, serving as central components for data labeling and policy execution in embodied AI. However, their performance degrades severely under manipulator truncation, a common failure mode that makes state recovery ill-posed and leads to unstable control. We present StableIDM, a spatio-temporal framework that refines features from visual inputs to stabilize action predictions under such partial observability. StableIDM integrates three complementary components: (1) auxiliary robot-centric masking to suppress background clutter, (2) Directional Feature Aggregation (DFA) for geometry-aware spatial reasoning, which extracts anisotropic features along directions inferred from the visible arm and (3) Temporal Dynamics Refinement (TDR) to smooth and correct predictions via motion continuity. Extensive evaluations validate our approach: StableIDM improves strict action accuracy by 12.1% under severe truncation on the AgiBot benchmark, and increases average task success by 9.7% in real-robot replay. Moreover, it boosts end-to-end grasp success by 11.5% when decoding video-generated plans, and improves downstream VLA real-robot success by 17.6% when functioning as an automatic annotator. These results demonstrate that StableIDM provides a robust and scalable backbone for both policy execution and data generation in embodied artificial intelligence.
翻译:逆动力学模型将视觉观测映射为底层动作指令,是具身智能中数据标注与策略执行的核心组件。然而,当面临机械臂截断这一常见失效模式时,其性能显著下降——该问题导致状态恢复不适定并引发控制不稳定。本文提出StableIDM,一种通过精化视觉输入特征来稳定部分可观测条件下动作预测的时空框架。StableIDM集成了三个互补模块:(1)辅助式机器人中心化掩膜以抑制背景干扰,(2)方向性特征聚合用于几何感知空间推理——沿可见机械臂推断方向提取各向异性特征,(3)时间动态精化模块通过运动连续性平滑并修正预测。大量实验验证了该方法:在AgiBot基准测试中,StableIDM在严重截断场景下将严格动作准确率提升12.1%;在真实机器人复现中使平均任务成功率提升9.7%。当解码视频生成规划时,其将端到端抓取成功率提升11.5%;作为自动标注器时,使下游VLA真实机器人任务成功率提升17.6%。结果表明StableIDM为具身人工智能中的策略执行与数据生成提供了鲁棒且可扩展的基础架构。