Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alternative to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.
翻译:近年来,视觉-语言-行动(VLA)模型的研究进展确立了一种双组件架构:其中预训练的视觉-语言模型(VLM)负责编码视觉观测和任务描述,而行动解码器则将这些表征映射到连续动作空间。扩散模型因其能够建模复杂、多模态的行动分布,已被广泛采用作为行动解码器。然而,这类模型在推理时需要多步迭代去噪过程,或需借助下游技术加速采样,这限制了其在需要高频控制的实际场景中的应用。本文提出NinA(行动中的归一化流),作为一种快速且表达能力强的替代方案,用于替代基于扩散的VLA解码器。NinA采用归一化流(NF)替代原有的扩散行动解码器,通过可逆变换实现单步采样,从而显著降低推理时间。我们将NinA集成到FLOWER VLA架构中,并在LIBERO基准上进行微调。实验表明,在相同训练条件下,NinA的性能与基于扩散的解码器相当,同时实现了显著更快的推理速度。这些结果表明,NinA为在不牺牲性能的前提下实现高效、高频的VLA控制提供了一条具有前景的技术路径。