Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.
翻译:动作监督的视觉-语言-动作策略微调能有效拟合示范数据,但仅约束改变预测动作的方向,导致动作等价状态下保持一致的视觉结构可能发生坍缩。我们将其形式化为沿局部动作纤维的残差视觉坍缩,并提出FiberTune——一种无需增加推理开销、保留教师结构视觉残差的训练目标。该方法通过在线动作探针估计动作预测特征方向,从中滤除中间视觉标记表示,并将探针过滤后的残差对齐至冻结的视觉教师,同时正则化其有效秩。在相同训练条件下,FiberTune在两个基准和两种架构(pi_0.5和OpenVLA-OFT)的六个受控模拟场景以及实际SO-101拾取放置任务中,性能均优于仅依赖任务损失的微调。典型增益包括:长程CALVIN ABC-to-D任务SR(5)提升10.7个百分点,物理SO-101任务成功率从72.7%提升至78.1%。残差诊断表明,这些增益与探针过滤残差教师对齐度和有效秩的提高相关,与动作纤维动机一致。