Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.
翻译:视觉-语言-动作(VLA)模型已成为通用机器人控制的一种有前景的范式,其中测试时缩放(TTS)技术因能提升训练外场景的鲁棒性而受到关注。然而,现有的VLA TTS方法需要额外训练、验证器以及多次前向传播,导致其难以实际部署。此外,这些方法仅在动作解码阶段进行干预,而保持视觉表征固定——这在感知模糊场景下是不充分的,因为重新考虑如何感知与决定执行何种动作同等重要。为克服这些局限,我们提出SCALE,一种受主动推理理论中不确定性驱动探索启发的简单推理策略。该方法基于"自不确定性"联合调制视觉感知与动作生成,无需额外训练、无需验证器,且仅需单次前向传播。SCALE在高不确定性条件下扩展感知与动作的探索空间,在置信度高时则聚焦于利用——从而实现跨不同条件的自适应执行。在仿真与真实世界基准测试上的实验表明,SCALE能提升现有先进VLA模型的性能,在保持单次推理效率的同时优于现有TTS方法。