Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three lines of evidence: a factorial experiment showing that encoder upgrades improve Diffusion Policy by over 21 percentage points while OAT gains are substantially attenuated across model scales; an encoder quality gradient across four encoders confirming that Diffusion Policy tracks encoder quality monotonically while OAT remains flat; and a codebook size experiment demonstrating that relaxing codebook capacity partially recovers encoder sensitivity, providing causal evidence for the bottleneck hypothesis. Our findings reveal that scaling in Physical AI requires identifying where information bottlenecks lie in the pipeline, rather than uniformly increasing model or data size.
翻译:通过升级视觉编码器来扩展视觉-语言-动作(Vision-Language-Action, VLA)模型,预期能像在视觉-语言建模中那样提升下游操作性能。但本研究表明,当动作表示为离散化标记(tokens)时,这一预期失效,并通过信息论原则——我们称之为"压缩鸿沟"(Compression Gap)——解释了原因:在任何视觉运动管线中,规模扩展行为取决于最紧致信息瓶颈的位置。当动作为连续形式时(如Diffusion Policy),视觉编码器成为约束瓶颈,升级它可直接提升性能;当动作通过固定容量码本(codebook)离散化时(如OAT),码本转变为约束瓶颈,编码器改进无法跨越该瓶颈——无论上游表征多么丰富。我们在LIBERO基准上通过三组证据验证了这一原则:因子实验表明,升级编码器使Diffusion Policy性能提升超21个百分点,而OAT在不同模型规模下增益显著衰减;采用四种编码器的编码器质量梯度实验确认,Diffusion Policy性能随编码器质量单调提升,而OAT保持平缓;码本容量实验表明,放宽码本容量可部分恢复对编码器的敏感性,这为瓶颈假说提供了因果证据。我们的发现揭示:物理智能中的规模扩展需要识别管线中信息瓶颈的位置,而非简单增大模型或数据规模。