Real-time generative game engines represent a paradigm shift in interactive simulation, promising to replace traditional graphics pipelines with neural world models. However, existing approaches are fundamentally constrained by the ``Memory Wall,'' restricting practical deployments to low resolutions (e.g., $64 \times 64$). This paper bridges the gap between generative models and high-resolution neural simulations by introducing a scalable \textit{Hardware-Algorithm Co-Design} framework. We identify that high-resolution generation suffers from a critical resource mismatch: the World Model is compute-bound while the Decoder is memory-bound. To address this, we propose a heterogeneous architecture that intelligently decouples these components across a cluster of AI accelerators. Our system features three core innovations: (1) an asymmetric resource allocation strategy that optimizes throughput under sequence parallelism constraints; (2) a memory-centric operator fusion scheme that minimizes off-chip bandwidth usage; and (3) a manifold-aware latent extrapolation mechanism that exploits temporal redundancy to mask latency. We validate our approach on a cluster of programmable AI accelerators, enabling real-time generation at $720 \times 480$ resolution -- a $50\times$ increase in pixel throughput over prior baselines. Evaluated on both continuous 3D racing and discrete 2D platformer benchmarks, our system delivers fluid 26.4 FPS and 48.3 FPS respectively, with an amortized effective latency of 2.7 ms. This work demonstrates that resolving the ``Memory Wall'' via architectural co-design is not merely an optimization, but a prerequisite for enabling high-fidelity, responsive neural gameplay.
翻译:实时生成式游戏引擎代表了交互式模拟的范式转变,有望以神经世界模型替代传统图形管线。然而,现有方法从根本上受限于“内存墙”,导致实际部署仅能支持低分辨率(例如$64 \times 64$)。本文通过引入一种可扩展的\textit{硬件-算法协同设计}框架,弥合了生成模型与高分辨率神经模拟之间的鸿沟。我们发现高分辨率生成面临关键资源失配问题:世界模型受计算能力限制,而解码器受内存带宽限制。为此,我们提出一种异构架构,将这两个组件智能解耦并分布至AI加速器集群。我们的系统包含三项核心创新:(1)在序列并行约束下优化吞吐量的非对称资源分配策略;(2)最小化片外带宽使用的以内存为中心的算子融合方案;(3)利用时序冗余掩盖延迟的流形感知潜在空间外推机制。我们在可编程AI加速器集群上验证了该方法,实现了$720 \times 480$分辨率的实时生成——像素吞吐量较现有基线提升$50\times$。在连续3D竞速与离散2D平台跳跃两类基准测试中,我们的系统分别实现了流畅的26.4 FPS与48.3 FPS,摊销有效延迟为2.7 ms。本研究表明,通过架构协同设计解决“内存墙”问题不仅是性能优化,更是实现高保真、低延迟神经游戏体验的先决条件。