The task of Image-to-Video (I2V) generation aims to synthesize a video from a reference image and a text prompt. This requires diffusion models to reconcile high-frequency visual constraints and low-frequency textual guidance during the denoising process. However, while existing I2V models prioritize visual consistency, how to effectively couple this dual guidance to ensure strong adherence to the text prompt remains underexplored. In this work, we observe that in Diffusion Transformer (DiT)-based I2V models, certain intermediate layers exhibit weak semantic responses (termed Semantic-Weak Layers), as indicated by a measurable drop in text-visual similarity. We attribute this to a phenomenon called Condition Isolation, where attention to visual features becomes partially detached from text guidance and overly relies on learned visual priors. To address this, we propose Focal Guidance (FG), which enhances the controllability from Semantic-Weak Layers. FG comprises two mechanisms: (1) Fine-grained Semantic Guidance (FSG) leverages CLIP to identify key regions in the reference frame and uses them as anchors to guide Semantic-Weak Layers. (2) Attention Cache transfers attention maps from semantically responsive layers to Semantic-Weak Layers, injecting explicit semantic signals and alleviating their over-reliance on the model's learned visual priors, thereby enhancing adherence to textual instructions. To further validate our approach and address the lack of evaluation in this direction, we introduce a benchmark for assessing instruction following in I2V models. On this benchmark, Focal Guidance proves its effectiveness and generalizability, raising the total score on Wan2.1-I2V to 0.7250 (+3.97\%) and boosting the MMDiT-based HunyuanVideo-I2V to 0.5571 (+7.44\%).
翻译:图像到视频(I2V)生成任务旨在根据参考图像和文本提示合成视频。这要求扩散模型在去噪过程中协调高频视觉约束与低频文本引导。然而,尽管现有I2V模型优先考虑视觉一致性,如何有效耦合这种双重引导以确保对文本提示的强遵循性仍未得到充分探索。在本工作中,我们观察到,在基于扩散Transformer(DiT)的I2V模型中,某些中间层表现出较弱的语义响应(称为语义弱层),这通过文本-视觉相似度的可测量下降得以体现。我们将此归因于一种称为“条件隔离”的现象,即对视觉特征的注意力部分脱离了文本引导,并过度依赖于学习到的视觉先验。为解决此问题,我们提出了焦点引导(FG),以增强来自语义弱层的可控性。FG包含两种机制:(1)细粒度语义引导(FSG)利用CLIP识别参考帧中的关键区域,并将其作为锚点来引导语义弱层。(2)注意力缓存将注意力图从语义响应层转移到语义弱层,注入显式的语义信号,减轻其对模型学习到的视觉先验的过度依赖,从而增强对文本指令的遵循性。为进一步验证我们的方法并弥补该方向评估的不足,我们引入了一个用于评估I2V模型指令遵循能力的基准测试。在此基准上,焦点引导证明了其有效性和泛化能力,将Wan2.1-I2V的总分提升至0.7250(+3.97%),并将基于MMDiT的HunyuanVideo-I2V提升至0.5571(+7.44%)。