Autoregressive (AR) architectures have achieved significant successes in LLMs, inspiring explorations for video generation. In LLMs, top-p/top-k sampling strategies work exceptionally well: language tokens have high semantic density and low redundancy, so a fixed size of token candidates already strikes a balance between semantic accuracy and generation diversity. In contrast, video tokens have low semantic density and high spatio-temporal redundancy. This mismatch makes static top-k/top-p strategies ineffective for video decoders: they either introduce unnecessary randomness for low-uncertainty regions (static backgrounds) or get stuck in early errors for high-uncertainty regions (foreground objects). Prediction errors will accumulate as more frames are generated and eventually severely degrade long-horizon quality. To address this, we propose Entropy-Guided k-Guard (ENkG) sampling, a simple yet effective strategy that adapts sampling to token-wise dispersion, quantified by the entropy of each token's predicted distribution. ENkG uses adaptive token candidate sizes: for low-entropy regions, it employs fewer candidates to suppress redundant noise and preserve structural integrity; for high-entropy regions, it uses more candidates to mitigate error compounding. ENkG is model-agnostic, training-free, and adds negligible overhead. Experiments demonstrate consistent improvements in perceptual quality and structural stability compared to static top-k/top-p strategies.
翻译:自回归(AR)架构在大型语言模型中取得了显著成功,这启发了其在视频生成领域的探索。在大型语言模型中,top-p/top-k采样策略表现优异:语言标记具有高语义密度和低冗余性,因此固定大小的候选标记集已能在语义准确性和生成多样性之间取得平衡。相比之下,视频标记具有低语义密度和高时空冗余性。这种不匹配使得静态的top-k/top-p策略对视频解码器效果不佳:它们要么在低不确定性区域(静态背景)引入不必要的随机性,要么在高不确定性区域(前景物体)陷入早期错误。随着生成帧数的增加,预测误差会不断累积,最终严重损害长序列生成质量。为解决这一问题,我们提出了熵引导的k-守卫(ENkG)采样策略,这是一种简单而有效的策略,它根据标记级离散度(通过每个标记预测分布的熵来量化)自适应调整采样。ENkG采用自适应候选标记数量:对于低熵区域,它使用较少的候选标记以抑制冗余噪声并保持结构完整性;对于高熵区域,它使用更多的候选标记以减轻误差累积。ENkG与模型无关,无需额外训练,且增加的开销可忽略不计。实验表明,与静态top-k/top-p策略相比,该方法在感知质量和结构稳定性方面均取得了持续改进。