视觉-语言-动作模型后训练与人类运动学习的平行关系：进展、挑战与趋势 (Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends)

Tian-Yu Xiang,Ao-Qun Jin,Xiao-Hu Zhou,Mei-Jiang Gui,Xiao-Liang Xie,Shi-Qi Liu,Shuang-Yi Wang,Sheng-Bin Duan,Fu-Chao Xie,Wen-Kai Wang,Si-Cheng Wang,Ling-Yun Li,Tian Tu,Zeng-Guang Hou

Vision-language-action (VLA) models extend vision-language models (VLM) by integrating action generation modules for robotic manipulation. Leveraging the strengths of VLM in vision perception and instruction understanding, VLA models exhibit promising generalization across diverse manipulation tasks. However, applications demanding high precision and accuracy reveal performance gaps without further adaptation. Evidence from multiple domains highlights the critical role of post-training to align foundational models with downstream applications, spurring extensive research on post-training VLA models. VLA model post-training aims to enhance an embodiment's ability to interact with the environment for the specified tasks. This perspective aligns with Newell's constraints-led theory of skill acquisition, which posits that motor behavior arises from interactions among task, environmental, and organismic (embodiment) constraints. Accordingly, this survey structures post-training methods into four categories: (i) enhancing environmental perception, (ii) improving embodiment awareness, (iii) deepening task comprehension, and (iv) multi-component integration. Experimental results on standard benchmarks are synthesized to distill actionable guidelines. Finally, open challenges and emerging trends are outlined, relating insights from human learning to prospective methods for VLA post-training. This work delivers both a comprehensive overview of current VLA model post-training methods from a human motor learning perspective and practical insights for VLA model development. Project website: https://github.com/AoqunJin/Awesome-VLA-Post-Training.

翻译：视觉-语言-动作（VLA）模型通过集成动作生成模块扩展了视觉-语言模型（VLM），以实现机器人操作任务。借助VLM在视觉感知与指令理解方面的优势，VLA模型在多样化操作任务中展现出良好的泛化能力。然而，在需要高精度与准确度的应用场景中，未经进一步适配的模型仍存在性能差距。多领域证据表明，后训练对于将基础模型与下游应用对齐具有关键作用，这推动了关于VLA模型后训练的广泛研究。VLA模型后训练旨在增强智能体在特定任务中与环境交互的能力。这一视角与纽厄尔提出的约束引导技能习得理论相契合，该理论认为运动行为产生于任务约束、环境约束与有机体（智能体）约束之间的相互作用。基于此，本综述将后训练方法归纳为四类：（i）增强环境感知能力，（ii）提升智能体状态认知，（iii）深化任务理解，以及（iv）多组件集成。通过综合标准基准测试的实验结果，提炼出具有实践指导意义的准则。最后，本文阐述了当前面临的开放挑战与新兴趋势，并将人类学习机制的相关启示与VLA后训练的潜在方法相联系。本工作不仅从人类运动学习视角系统梳理了当前VLA模型后训练方法，还为VLA模型开发提供了实践参考。项目网站：https://github.com/AoqunJin/Awesome-VLA-Post-Training。