DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation

Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin 2.0 and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.

翻译：现实世界的家庭机器人需要视觉-语言-动作（VLA）基础模型，使其能够获取跨不同物体、任务条件和家庭环境的可复用操作技能。柔性物体折叠是一项典型挑战，要求机器人处理来自随机初始状态的衣物，并覆盖不同类别、几何形状、材质和场景。然而，现有的VLA系统通常针对不同物体类别分别训练策略，而简单混合的多任务训练常常导致任务干扰和性能下降。为了突破类别特定的折叠策略限制，我们提出了DeMaVLA——一个面向可泛化柔性物体操作的VLA基础模型。DeMaVLA采用配备动作专家的视觉语言模型（VLM）主干，并通过流匹配（flow matching）公式化连续动作生成。为提升效率，动作专家通过每隔一个Transformer层进行剪枝构建，同时保持与VLM主干的逐层对齐，从而降低训练和推理成本。DeMaVLA首先在约5000小时精选的真实世界双臂演示数据上预训练，以获得通用操作先验知识；随后通过人在回路的数据聚合（DAgger）流水线，在混合折叠数据上进行后训练——该数据整合了自采演示轨迹以及跨多个折叠任务的真实机器人失败纠正轨迹。实验表明，DeMaVLA在RoboTwin 2.0上取得了具有竞争力的性能，并在我们的家庭折叠基准测试中展现出强大的真实世界表现。这些结果凸显了可扩展真实世界数据、高效动作生成和纠正性学习对于面向柔性物体操作的通用VLA策略的重要价值。