ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models

Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM's locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.

翻译：视觉-语言-动作（VLA）模型仍受限于动作标注机器人数据的稀缺性，而无动作视频为物理世界的变化提供了丰富的证据。隐状态动作模型为从视频中提取此类先验知识提供了有前景的途径，但基于重构训练的隐状态编码未必适用于策略生成：它们虽能预测未来观测，却缺乏支持与机器人动作协同复用或生成所需的结构性。我们提出ALAM（代数隐状态动作模型），该模型将无动作视频中的时序关系转化为结构化监督信号。给定帧三元组，ALAM学习基于重构约束的隐状态转移，同时通过组合一致性与反转一致性进行正则化，从而构建局部可加性转移空间。在下游VLA学习阶段，我们冻结预训练编码器，将其隐状态转移序列作为辅助生成目标，与机器人动作在联合流匹配目标下协同生成。这种做法将结构化隐状态转移与基于流的策略生成相结合，使策略能够利用ALAM的局部一致转移几何结构，无需隐状态到动作的解码过程。表征探测实验表明，相比无结构化隐状态动作基线，ALAM将可加性和可逆性误差降低25-85倍，同时提升了长时序累积重构性能。将ALAM迁移至VLA策略时，其在MetaWorld MT50上的平均成功率从47.9%提升至85.0%，在LIBERO上从94.1%提升至98.1%，并在真实世界操控任务中取得一致增益。消融研究进一步证实，最大性能提升源于代数结构化隐状态转移与联合流匹配的协同效应。