Vision-Language-Action (VLA) models show promise in embodied reasoning, yet remain far from true generalists-they often require task-specific fine-tuning, incur high compute costs, and generalize poorly to unseen tasks. We propose MetaVLA, a unified, backbone-agnostic post-training framework for efficient and scalable alignment. MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse target tasks into a single fine-tuning stage while leveraging structurally diverse auxiliary tasks to improve in-domain generalization. Unlike naive multi-task SFT, MetaVLA integrates a lightweight meta-learning mechanism-derived from Attentive Neural Processes-to enable rapid adaptation from diverse contexts with minimal architectural change or inference overhead. On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K, and cuts GPU time by ~76%. These results show that scalable, low-resource post-training is achievable-paving the way toward general-purpose embodied agents. Code will be available.
翻译:视觉-语言-动作(VLA)模型在具身推理任务中展现出潜力,但距离真正的通用智能体仍有差距——它们通常需要针对特定任务进行微调,计算成本高昂,且在未见任务上泛化能力不足。本文提出MetaVLA,一个统一、主干网络无关的后训练框架,旨在实现高效且可扩展的模型对齐。MetaVLA引入了上下文感知元协同训练机制,该机制将多样化的目标任务整合到单一微调阶段,同时利用结构各异的辅助任务提升领域内泛化能力。与简单的多任务监督微调不同,MetaVLA集成了源自注意力神经过程的轻量级元学习机制,使其能够以最小的架构改动或推理开销,从多样化上下文中快速适应。在LIBERO基准测试中,配备六项辅助任务的MetaVLA在长视野任务上性能超越OpenVLA达8.0%,训练步数从240K减少至75K,GPU时间降低约76%。这些结果表明,可扩展、低资源的后训练是可行的——为通向通用具身智能体铺平了道路。代码即将公开。