Vision-Language-Action (VLA) models have demonstrated remarkable zero-shot generalization in robotic manipulation, yet the vast majority of pre-trained pipelines remain strictly confined to low-DoF parallel grippers. Adapting these rich semantic priors to high-DoF dexterous hands introduces a severe morphology gap, direct end-to-end joint fine-tuning inherently causes catastrophic forgetting of spatial reasoning and acute action manifold collapse due to data scarcity. In this paper, we present InDex, a novel, data-efficient adaptation framework rooted in cross-morphology semantic inheritance. Rather than discarding the pre-trained 1-DoF parallel grasp output, we repurpose it as a continuous, macroscopic virtual grasp intent proxy to sequentialize the control topology. We implement a two-stage decoupled learning architecture: the first stage parameter-efficiently aligns the VLA backbone to predict continuous arm trajectories and the scalar grasp intent; the second stage freezes this spatial backbone and leverages an intent-conditioned denoising diffusion head to decode fine-grained joint articulations for multi-fingered end-effectors. Extensive simulation benchmarks across a suite of multi-stage, contact-rich dexterous manipulation tasks demonstrate that InDex effectively masters intricate skills with minimal demonstration data, substantially outperforming monolithic baselines while preserving the robust spatial generalizability of the original VLA prior.
翻译:视觉-语言-动作(VLA)模型在机器人操作中展现出卓越的零样本泛化能力,然而绝大多数预训练流水线仍严格局限于低自由度平行夹爪。将此类丰富的语义先验知识迁移至高自由度灵巧手时,会引入严重的形态鸿沟——直接进行端到端联合微调本质上会导致空间推理能力灾难性遗忘,并因数据稀缺引发动作流形急剧坍缩。本文提出InDex这一新型数据高效适配框架,其根植于跨形态语义继承机制。我们并未抛弃预训练的1自由度平行抓取输出,而是将其重构为连续宏观虚拟抓取意图代理,以实现控制拓扑的序列化。我们构建了两阶段解耦学习架构:第一阶段以参数高效方式对齐VLA骨干网络,使其预测连续手臂轨迹与标量抓取意图;第二阶段冻结该空间骨干网络,利用基于意图条件的去噪扩散解码器为多指末端执行器解码细粒度关节动作。在包含多阶段、高接触性灵巧操作任务的系列仿真基准测试中,InDex凭借极少量演示数据即可高效掌握复杂技能,在显著优于单一模型基线方案的同时,完整保留了原始VLA先验的鲁棒空间泛化能力。