Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.
翻译:文本到视频扩散模型已实现开放式的视频合成,但在生成提示词中指定数量的对象时常常面临困难。我们提出NUMINA——一种无需训练的"识别-引导"框架,旨在改进数字对齐效果。该框架通过选择具有判别性的自注意力头与交叉注意力头,推导出可计数的隐式布局,从而识别提示词与布局之间的不一致性。随后,它对这一布局进行保守优化,并调节交叉注意力以引导重新生成。在所提出的CountBench基准上,NUMINA将Wan2.1-1.3B模型的数量准确率提升高达7.4%,在5B和14B模型上分别提升4.9%和5.5%。此外,CLIP对齐质量得到改善,同时保持了时间一致性。这些结果表明,结构性引导可有效补充种子搜索与提示增强策略,为实现数量精准的文本到视频扩散提供了一条实用路径。代码已开源:https://github.com/H-EmbodVis/NUMINA。