In recent years, training-free video generation has progressed remarkably. However, when handling complex textual instructions, existing methods still suffer from semantic ambiguity, incorrect concept binding, and cross-frame inconsistency. To address these issues, we propose KGEdit, a structured semantic control framework for text-to-video (T2V) diffusion models. Specifically, we first construct an ambiguity-aware knowledge graph (AAKG) to disentangle and disambiguate the input prompt, converting it into four types of structured semantics: identity, relation, attribute, and negative constraints. We then design a structured semantic injection module (SSIM) to inject these semantic signals into key layers of the diffusion Transformer, enabling fine-grained semantic control. In addition, we introduce a temporal-aware semantic control (TASC) module that dynamically schedules semantic objectives according to the stage-wise characteristics of the denoising process, further improving semantic alignment and temporal consistency. Experiments show that KGEdit outperforms existing methods in editing precision and temporal stability, while offering higher efficiency and controllability in text-driven interaction scenarios.
翻译:近年来,免训练视频生成技术取得了显著进展。然而,在处理复杂文本指令时,现有方法仍存在语义歧义、错误概念绑定以及跨帧不一致等问题。针对这些挑战,我们提出KGEdit——一种面向文本到视频(T2V)扩散模型的结构化语义控制框架。具体而言,我们首先构建歧义感知知识图谱(AAKG),对输入提示进行解耦与消歧,将其转化为四类结构化语义:身份、关系、属性及负约束。随后设计结构化语义注入模块(SSIM),将这些语义信号注入扩散Transformer的关键层,实现细粒度语义控制。此外,我们引入时序感知语义控制(TASC)模块,根据去噪过程的阶段性特征动态调度语义目标,进一步优化语义对齐与时间一致性。实验表明,KGEdit在编辑精度和时间稳定性上优于现有方法,同时在文本驱动的交互场景中具有更高的效率与可控性。