Despite advances in Vision-Language-Action (VLA) models, robotic manipulation struggles with fine-grained tasks because current models lack mechanisms for active visual attention allocation. Human gaze naturally encodes intent, planning, and execution patterns -- offering a powerful supervisory signal for guiding robot perception. We introduce a gaze-regularized training framework that aligns VLA models' internal attention with human visual patterns without architectural modifications or inference-time overhead. Our method transforms temporally aggregated gaze heatmaps into patch-level distributions and regularizes the transformer's attention through KL divergence, creating an inductive bias toward task-relevant features while preserving deployment efficiency. When integrated into existing VLA architectures, our approach yields 4-12% improvements across manipulation benchmarks. The gaze-regularized models reach equivalent performance with fewer training steps and maintain robustness under lighting variations and sensor noise. Beyond performance metrics, the learned attention patterns produce interpretable visualizations that mirror human strategies, enhancing trust in robotic systems. Moreover, our framework requires no eye-tracking equipment and applies directly to existing datasets. These results demonstrate that human perceptual priors can significantly accelerate robot learning while improving both task performance and system interpretability.
翻译:尽管视觉-语言-动作(VLA)模型取得了进展,但由于当前模型缺乏主动视觉注意力分配机制,机器人操作在精细任务中仍面临挑战。人类凝视天然编码了意图、规划和执行模式——为引导机器人感知提供了强大的监督信号。我们提出了一种凝视正则化训练框架,该框架在不改变架构或增加推理开销的情况下,将VLA模型的内部注意力与人类视觉模式对齐。该方法将时间聚合的凝视热图转换为补丁级分布,并通过KL散度对Transformer的注意力进行正则化,从而在保持部署效率的同时,形成面向任务相关特征的归纳偏置。当集成到现有VLA架构中时,我们的方法在操作基准测试中带来了4%-12%的性能提升。凝视正则化模型在更少的训练步骤下达到了同等性能,并在光照变化和传感器噪声下保持了鲁棒性。除性能指标外,学习到的注意力模式生成了可解释的、反映人类策略的可视化结果,增强了机器人系统的可信度。此外,我们的框架无需眼动追踪设备,可直接应用于现有数据集。这些结果表明,人类感知先验能够显著加速机器人学习,同时提升任务性能与系统可解释性。