Diffusion-based video editing has made significant progress; however, achieving precise and temporally consistent object-level control, especially in multi-object scenarios, remains challenging due to attention leakage, identity drift, and unstable temporal dynamics. In this work, we propose IDAGEdit, a training-free framework for fine-grained multi-object video editing with strong temporal consistency. The framework adopts Layout-guided Attention Modulation to facilitate coherent multi-object editing, while Instance-level Masks are introduced to preserve individual object identity and enforce localized attention within each object region, thereby enabling fine-grained, object-level editing. Extensive qualitative and quantitative evaluations demonstrate that our method improves temporal stability and multi-object controllability over state-of-the-art video editing approaches.
翻译:暂无翻译