Auto-regressive (AR) models have recently made notable progress in image generation, achieving performance comparable to diffusion-based approaches. However, their computational intensity and sequential nature impede on-device deployment, causing disruptive latency. We address this via a cloud-device collaboration framework \textbf{CIAR}, which utilizes on-device self-verification to handle two key properties of visual synthesis: \textit{the vast token vocabulary} required for high-fidelity images and \textit{inherent spatial redundancy} which leads to extreme predictability in homogeneous regions, while object boundaries exhibit high uncertainty. Uniform verification wastes resources on such redundant tokens. Our solution centers on an on-device token uncertainty quantifier, which adopts continuous probability intervals to accelerate processing and make it feasible for large visual vocabularies instead of conventional discrete solution sets. Additionally, we incorporate a Interval-enhanced decoding module to further speed up decoding while maintaining visual fidelity and semantic consistency via a distribution alignment training strategy. Extensive experiments demonstrate that CIAR achieves a 2.18x speed-up and reduces cloud requests by 70\%, while preserving image quality compared to existing methods.
翻译:自回归(AR)模型近期在图像生成领域取得了显著进展,性能可与扩散方法相媲美。然而,其计算密集性和序列化特性阻碍了设备端部署,导致中断性延迟。我们通过云-设备协同框架CIAR解决这一问题,该框架利用设备端自验证机制处理视觉合成的两个关键特性:高保真图像所需的大规模标记词汇表,以及同质区域中因内在空间冗余导致的极端可预测性——而对象边界则表现出高度不确定性。统一验证会浪费此类冗余标记上的资源。我们的解决方案核心在于设备端标记不确定性量化器,该量化器采用连续概率区间来加速处理,使其能够适用于大规模视觉词汇表,而非传统的离散解集。此外,我们引入区间增强解码模块,通过分布对齐训练策略在保持视觉保真度和语义一致性的同时进一步加速解码。大量实验表明,与现有方法相比,CIAR实现了2.18倍的加速,并将云端请求减少70%,同时保持了图像质量。