Vision-Language-Action (VLA) models have emerged as powerful generalist policies for robotic manipulation, yet they remain fundamentally limited by their reliance on behavior cloning, leading to brittleness under distribution shift. While augmenting pretrained models with test-time search algorithms like Monte Carlo Tree Search (MCTS) can mitigate these failures, existing formulations rely solely on the VLA prior for guidance, lacking a grounded estimate of expected future return. Consequently, when the prior is inaccurate, the planner can only correct action selection via the exploration term, which requires extensive simulation to become effective. To address this limitation, we introduce Value Vision-Language-Action Planning and Search (V-VLAPS), a framework that augments MCTS with a lightweight, learnable value function. By training a simple multilayer perceptron (MLP) on the latent representations of a fixed VLA backbone (Octo), we provide the search with an explicit success signal that biases action selection toward high-value regions. We evaluate V-VLAPS on the LIBERO robotic manipulation suite, demonstrating that our value-guided search improves success rates by over 5 percentage points while reducing the average number of MCTS simulations by 5-15 percent compared to baselines that rely only on the VLA prior.
翻译:视觉-语言-动作(VLA)模型已成为机器人操作领域强大的通用策略,但其本质上仍受限于对行为克隆的依赖,导致在分布偏移下表现脆弱。虽然通过测试时搜索算法(如蒙特卡洛树搜索)增强预训练模型可以缓解这些失败,但现有方法仅依赖VLA先验进行引导,缺乏对未来期望回报的可靠估计。因此,当先验不准确时,规划器只能通过探索项来修正动作选择,这需要大量仿真才能生效。为解决这一局限,我们提出了价值视觉-语言-动作规划与搜索(V-VLAPS),该框架通过轻量级可学习价值函数增强MCTS。通过在固定VLA骨干网络(Octo)的潜在表示上训练简单多层感知机(MLP),我们为搜索提供了显式的成功信号,使动作选择偏向高价值区域。我们在LIBERO机器人操作套件上评估V-VLAPS,结果表明:与仅依赖VLA先验的基线方法相比,我们的价值引导搜索将成功率提高了超过5个百分点,同时将MCTS平均仿真次数减少了5-15%。