Perfect Demo Makes Poor Teacher: Learning Robust Alignment from Critical Motion Segments

Expert demonstrations are widely assumed to be the gold standard for robot imitation learning. Yet for fine-grained manipulation such as insertion, stacking, and alignment, we uncover a counterintuitive failure mode: fluent demonstrations can be poor teachers. A skilled teleoperator compresses the decisive moments of alignment and recovery into a brief temporal window, leaving the policy flooded with redundant free-space motion and starved of supervision exactly where precision determines success. We address this bottleneck at two levels. At the data level, slowing down near alignment and resampling critical segments both help, yet the gain comes mainly from broadening the coverage of recovery states the policy must learn, not from reweighting frames it already has. Such data-side fixes, however, leave the policy's per-frame view untouched: a single image still maps directly to an action, and the local motion that governs correction stays implicit. We therefore turn to the representation level and introduce STAIR (\textbf{S}patio-\textbf{T}emporal feature \textbf{A}s an \textbf{I}nterface for \textbf{R}obot learning), a compact dynamic feature that bridges the vision-language model and the action expert, distilling the short-horizon motion already recorded in each trajectory into dense, motion-aware supervision. Trained on fluent data alone, STAIR recovers most of the deliberate-demonstration gain ($50.0$ to $62.2\%$ overall, approaching the $64.4\%$ of deliberate demonstrations). These results call for a more pedagogical view of robot data, optimized for machine learnability rather than human efficiency alone.

翻译：专家示范被广泛认为是机器人模仿学习的黄金标准。然而，对于插装、堆叠和对齐等精细操作，我们发现了一个反直觉的失败模式：流畅的示范反而可能是糟糕的教师。熟练的遥操作员将决定性的对齐和恢复动作压缩在极短的时间窗口内，导致策略充斥着冗余的自由空间运动，并在精确性决定成败的关键时刻缺乏监督。我们从两个层面解决这一瓶颈。在数据层面，在对齐阶段放慢速度以及对关键片段进行重采样均有所助益，但收益主要来自扩大策略必须学习的恢复状态的覆盖范围，而非对已有帧进行重加权。然而，这种数据层面的修正并未触及策略的逐帧视角：单张图像仍直接映射到一个动作，而控制修正的局部运动依然隐式存在。因此，我们转向表征层面，引入STAIR（\textbf{S}patio-\textbf{T}emporal feature \textbf{A}s an \textbf{I}nterface for \textbf{R}obot learning），一种紧凑的动态特征，它连接了视觉语言模型与动作专家，将每条轨迹中已记录的短时运动蒸馏为密集的、运动感知的监督信号。仅使用流畅数据训练，STAIR便恢复了刻意示范带来的大部分增益（总体从$50.0\%$提升至$62.2\%$，接近刻意示范的$64.4\%$）。这些结果表明，应更注重机器人数据的教学视角，即针对机器可学习性而非仅人类效率进行优化。