Vision-language-action (VLA) policies typically inherit their vision encoder from upstream VLM releases, but it is unclear whether an encoder choice validated on a small VLA transfers to a larger backbone. We introduce a frozen-backbone grafting diagnostic: the vision tower of a released VLA is replaced by a candidate encoder under a fixed protocol (adaptive average pooling, LayerNorm, and a single trainable linear projector), with the language model and action expert frozen. Across four encoders, two LIBERO suites, two backbones (SmolVLA-450M and $π_{0.5}$-3.3B), and two-to-three seeds per cell (40 main grafting runs plus native, LoRA, pooling, and zero-/shuffled-image controls, all scored by offline action MSE), the small-backbone winner does not reliably select the large-backbone top tier: SigLIP is best on SmolVLA across both suites, while on $π_{0.5}$ DINOv2-small leads the spatial suite and the object suite is a seed-sensitive near-tie band; three of the four backbone-suite comparisons (and 11 of 12 seed-level cells) support backbone-dependent rankings. The grafting wrapper is itself non-neutral with opposite sign across backbones (+45-56% MSE on the SmolVLA native tower, -50-52% on $π_{0.5}$), so all conclusions are conditional on the fixed grafting protocol. We position frozen grafting as a cheap target-backbone diagnostic to run before committing to an encoder at scale, not as a closed-loop deployment claim.
翻译:视觉-语言-动作(VLA)策略通常继承自上游VLM发布的视觉编码器,但尚不清楚在小规模VLA上验证的编码器选择能否迁移至更大规模的骨干网络。我们提出一种冻结骨干网络的嫁接诊断方法:将已发布VLA的视觉塔替换为候选编码器,采用固定协议(自适应平均池化、层归一化和单个可训练线性投影器),同时冻结语言模型和动作专家。在四种编码器、两个LIBERO套件、两个骨干网络(SmolVLA-450M和$π_{0.5}$-3.3B)以及每个实验单元两到三个随机种子(共40次主要嫁接实验,外加原生、LoRA、池化以及零/打乱图像对照实验,均以离线动作MSE评估)的条件下,小规模骨干网络的优胜者无法可靠地选出大规模骨干网络中的最佳编码器:SigLIP在SmolVLA的两个套件上均表现最佳,而在$π_{0.5}$上DINOv2-small在空间套件中领先,物体套件则呈现种子敏感的近似平局局面;四组骨干网络-套件比较中有三组(以及12个种子级实验单元中的11个)支持依赖骨干网络的排序结果。嫁接封装本身并非中性,其在两个骨干网络上符号相反(在SmolVLA原生塔上MSE增加+45-56%,在$π_{0.5}$上MSE降低-50-52%),因此所有结论均基于固定嫁接协议。我们将冻结嫁接定位为一种低成本的针对目标骨干网络的诊断方法,用于在规模部署编码器前进行测试,而非闭环部署的最终声明。