Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP-ML/Code2World.
翻译:自主图形用户界面(GUI)代理通过感知界面并执行操作与环境进行交互。作为虚拟沙盒,GUI世界模型通过支持条件动作预测,赋予代理类人的预见能力。然而,现有的基于文本和像素的方法难以同时实现高视觉保真度与细粒度结构可控性。为此,我们提出Code2World——一种通过可渲染代码生成来模拟下一视觉状态的视觉语言编码器。具体而言,为应对数据稀缺问题,我们构建了AndroidCode数据集:将GUI交互轨迹转换为高保真HTML代码,并通过视觉反馈修正机制优化合成代码,最终获得包含超过8万组高质量屏幕-动作对的数据语料。为使现有视觉语言模型适应代码预测任务,我们首先通过监督微调实现格式布局遵循的冷启动,进而应用渲染感知强化学习——该机制以渲染结果为奖励信号,强制保证视觉语义保真度与动作一致性。大量实验表明,Code2World-8B在下一代用户界面预测任务中达到最优性能,可与GPT-5和Gemini-3-Pro-Image等竞争模型相媲美。值得注意的是,Code2World能以灵活方式显著提升下游导航任务成功率,在AndroidWorld导航任务中将Gemini-2.5-Flash的性能提升9.5%。代码已开源:https://github.com/AMAP-ML/Code2World。