Despite significant progress in Visual-Language-Action (VLA), in highly complex and dynamic environments that involve real-time unpredictable interactions (such as 3D open worlds and large-scale PvP games), existing approaches remain inefficient at extracting action-critical signals from redundant sensor streams. To tackle this, we introduce MAIN-VLA, a framework that explicitly Models the Abstraction of Intention and eNvironment to ground decision-making in deep semantic alignment rather than superficial pattern matching. Specifically, our Intention Abstraction (IA) extracts verbose linguistic instructions and their associated reasoning into compact, explicit semantic primitives, while the Environment Semantics Abstraction (ESA) projects overwhelming visual streams into a structured, topological affordance representation. Furthermore, aligning these two abstract modalities induces an emergent attention-concentration effect, enabling a parameter-free token-pruning strategy that filters out perceptual redundancy without degrading performance. Extensive experiments in open-world Minecraft and large-scale PvP environments (Game for Peace and Valorant) demonstrate that MAIN-VLA sets a new state-of-the-art, which achieves superior decision quality, stronger generalization, and cutting-edge inference efficiency.
翻译:尽管视觉-语言-动作模型取得了显著进展,但在涉及实时不可预测交互的高度复杂动态环境(如3D开放世界和大型PvP游戏)中,现有方法从冗余传感器流中提取动作关键信号的效率仍然低下。为解决这一问题,我们提出了MAIN-VLA框架,该框架显式地对意图与环境进行抽象建模,将决策制定建立在深层语义对齐而非浅层模式匹配的基础上。具体而言,我们的意图抽象模块将冗长的语言指令及其关联推理过程提取为紧凑、显式的语义基元,而环境语义抽象模块则将海量视觉流映射为结构化的拓扑可供性表示。此外,对齐这两种抽象模态会诱发一种涌现的注意力集中效应,从而支持一种无需参数的令牌剪枝策略,该策略能在不降低性能的前提下过滤感知冗余。在开放世界Minecraft及大规模PvP环境(《和平精英》与《Valorant》)中的大量实验表明,MAIN-VLA确立了新的技术标杆,实现了更优的决策质量、更强的泛化能力以及顶尖的推理效率。