The powerful generalization of Vision-Language-Action (VLA) models is bottlenecked by their heavy reliance on massive, redundant, and unevenly valued datasets, hindering their widespread application. Existing model-centric optimization paths, such as model compression (which often leads to performance degradation) or policy distillation (whose products are model-dependent and lack generality), fail to fundamentally address this data-level challenge. To this end, this paper introduces FT-NCFM, a fundamentally different, data-centric generative data distillation framework. Our framework employs a self-contained Fact-Tracing (FT) engine that combines causal attribution with programmatic contrastive verification to assess the intrinsic value of samples. Guided by these assessments, an adversarial NCFM process synthesizes a model-agnostic, information-dense, and reusable data asset. Experimental results on several mainstream VLA benchmarks show that models trained on just 5% of our distilled coreset achieve a success rate of 85-90% compared with training on the full dataset, while reducing training time by over 80%. Our work demonstrates that intelligent data distillation is a highly promising new path for building efficient, high-performance VLA models.
翻译:视觉-语言-动作(VLA)模型的强大泛化能力受限于其对海量、冗余且价值分布不均的数据集的严重依赖,这阻碍了其广泛应用。现有的以模型为中心的优化路径,如模型压缩(通常导致性能下降)或策略蒸馏(其产物依赖于特定模型且缺乏通用性),未能从根本上解决这一数据层面的挑战。为此,本文提出了FT-NCFM,一种根本不同的、以数据为中心的生成式数据蒸馏框架。该框架采用一个自包含的事实追溯(FT)引擎,结合因果归因与程序化对比验证来评估样本的内在价值。在这些评估的指导下,一个对抗性的NCFM过程合成出与模型无关、信息密集且可复用的数据资产。在多个主流VLA基准上的实验结果表明,仅使用我们蒸馏出的核心集5%的数据进行训练的模型,相比使用完整数据集训练,成功率可达85-90%,同时训练时间减少超过80%。我们的工作表明,智能数据蒸馏是构建高效、高性能VLA模型的一条极具前景的新路径。