Vision--Language--Action (VLA) models bridge multimodal reasoning with physical control, but adapting them to new tasks with scarce demonstrations remains unreliable. While fine-tuned VLA policies often produce semantically plausible trajectories, failures often arise from unresolved geometric ambiguities, where near-miss action candidates lead to divergent execution outcomes under limited supervision. We study few-shot VLA adaptation from a \emph{generation--selection} perspective and propose a novel framework \textbf{VGAS} (\textbf{V}alue-\textbf{G}uided \textbf{A}ction-chunk \textbf{S}election). It performs inference-time best-of-$N$ selection to identify action chunks that are both semantically faithful and geometrically precise. Specifically, \textbf{VGAS} employs a finetuned VLA as a high-recall proposal generator and introduces the \textrm{Q-Chunk-Former}, a geometrically grounded Transformer critic to resolve fine-grained geometric ambiguities. In addition, we propose \textit{Explicit Geometric Regularization} (\texttt{EGR}), which explicitly shapes a discriminative value landscape to preserve action ranking resolution among near-miss candidates while mitigating value instability under scarce supervision. Experiments and theoretical analysis demonstrate that \textbf{VGAS} consistently improves success rates and robustness under limited demonstrations and distribution shifts. Our code is available at https://github.com/Jyugo-15/VGAS.
翻译:视觉-语言-动作(VLA)模型将多模态推理与物理控制相连接,但在仅有少量演示的新任务上适应时,其表现仍不稳定。尽管经过微调的VLA策略常能生成语义合理的轨迹,但失败往往源于未解决的几何模糊性:在有限监督下,接近成功的动作候选会导致差异化的执行结果。本文从**生成-选择**的视角研究少样本VLA适应问题,并提出一种新颖的框架**VGAS**(基于**V**alue-**G**uided **A**ction-chunk **S**election)。该框架在推理时执行最优-$N$选择,以识别同时满足语义忠实性与几何精确性的动作块。具体而言,**VGAS**采用微调后的VLA作为高召回率的提议生成器,并引入几何基础Transformer评估器**Q-Chunk-Former**,以解决细粒度的几何模糊性。此外,我们提出**显式几何正则化**(**EGR**),通过显式塑造判别性价值景观,在保持接近成功候选动作间排序分辨能力的同时,缓解稀缺监督下的价值不稳定性。实验与理论分析表明,**VGAS**在有限演示和分布偏移条件下,持续提升了任务成功率和鲁棒性。代码发布于 https://github.com/Jyugo-15/VGAS。