VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation - 专知论文

会员服务 ·

0

控制器 · 变换 · 查准率/准确率 · 有向 · 分解的 ·

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

翻译：暂无翻译

Sixiao Zheng,Zimian Peng,Yanpeng Zhou,Yi Zhu,Hang Xu,Xiangru Huang,Yanwei Fu

from arxiv, Accepted to TVCG 2026

Controllable image-to-video (I2V) generation transforms a reference image into a coherent video guided by user-specified control signals. While precise control over camera motion, object motion, and lighting is essential for high-fidelity creation, existing methods often treat these factors independently. This overlooks the physical coupling among viewpoint, geometry, and illumination in dynamic scenes, leading to visual inconsistencies such as mismatched shadows and perspective drift under simultaneous changes. We present VidCRAFT3, a unified and flexible I2V framework that explicitly models cross-factor interactions among geometry, motion, and illumination, enabling both independent and joint control over camera motion, object motion, and lighting direction. Image2Cloud provides explicit 3D geometric priors for accurate camera motion control. ObjMotionNet encodes sparse object trajectories into multi-scale motion features to guide realistic object motion. A Spatial Triple-Attention Transformer integrates lighting direction through lighting cross-attention for consistent relighting. To address the scarcity of jointly annotated data, we construct the VideoLightingDirection (VLD) dataset with accurate per-frame lighting direction annotations, and introduce a three-stage progressive training strategy that enables robust learning without fully joint annotations. Extensive experiments demonstrate that VidCRAFT3 achieves state-of-the-art performance in control precision and visual coherence across diverse scenarios.

翻译：暂无翻译

0

相关内容

控制器

RSS 2024 | NaVid：视觉语言导航大模型

RSS 2024 | NaVid：视觉语言导航大模型

专知会员服务

34+阅读 · 2024年6月9日

【ICCV2023最佳论文】ControlNet: 为文本到图像扩散模型添加条件控制

【ICCV2023最佳论文】ControlNet: 为文本到图像扩散模型添加条件控制

专知会员服务

27+阅读 · 2023年10月4日

ICML2023 | 轻量级视觉Transformer(ViT)的预训练实践手册

ICML2023 | 轻量级视觉Transformer(ViT)的预训练实践手册

专知会员服务

42+阅读 · 2023年5月10日

【CVPR 2022】基于Transformer的图象风格化，StyTr2: Image Style Transfer with Transformers

【CVPR 2022】基于Transformer的图象风格化，StyTr2: Image Style Transfer with Transformers

专知会员服务

11+阅读 · 2022年3月19日

【CVPR 2022】可控图像合成与编辑的合成生成先验学习，SemanticStyleGAN: Learning Compositonal Generative Priors for Controllable Image Synthesis and Editing

【CVPR 2022】可控图像合成与编辑的合成生成先验学习，SemanticStyleGAN: Learning Compositonal Generative Priors for Controllable Image Synthesis and Editing

专知会员服务

23+阅读 · 2022年3月3日

【AAAI2022】基于交互式transformer和暹罗网络的视频目标分割

【AAAI2022】基于交互式transformer和暹罗网络的视频目标分割

专知会员服务

24+阅读 · 2022年2月6日

Swin Transformer重磅升级！Swin V2：向更大容量、更高分辨率的更大模型迈进

Swin Transformer重磅升级！Swin V2：向更大容量、更高分辨率的更大模型迈进

专知会员服务

28+阅读 · 2021年11月20日

[ICCV 2021] 从二到一：一种带有视觉语言建模网络的新场景文本识别器

专知会员服务

18+阅读 · 2021年10月17日

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

专知会员服务

24+阅读 · 2019年12月15日

Understanding Color and the In-Camera Image Processing Pipeline for Computer Vision 【Michael S. Brown IEEE】韩国 ICCV 2019

Understanding Color and the In-Camera Image Processing Pipeline for Computer Vision 【Michael S. Brown IEEE】韩国 ICCV 2019

专知会员服务

10+阅读 · 2019年10月30日

CVPR 2020 | 看图说话之随心所欲：细粒度可控的图像描述自动生成

CVPR 2020 | 看图说话之随心所欲：细粒度可控的图像描述自动生成

AI科技评论

14+阅读 · 2020年3月16日

使用双目相机进行三维重建第二部分：姿态估计

使用双目相机进行三维重建第二部分：姿态估计

AI研习社

12+阅读 · 2019年5月7日

新型相机DVS/Event-based camera的发展及应用

新型相机DVS/Event-based camera的发展及应用

计算机视觉life

16+阅读 · 2019年3月12日

基于 TensorFlow 、OpenCV 和 Docker 的实时视频目标检测

基于 TensorFlow 、OpenCV 和 Docker 的实时视频目标检测

AI研习社

10+阅读 · 2018年7月23日

【论文推荐】最新八篇视频描述生成相关论文—在线视频理解、联合定位和描述事件、生成视频、跨模态注意力机制、联合事件检测和描述

【论文推荐】最新八篇视频描述生成相关论文—在线视频理解、联合定位和描述事件、生成视频、跨模态注意力机制、联合事件检测和描述

专知

11+阅读 · 2018年6月4日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

15+阅读 · 2018年5月29日

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

专知

15+阅读 · 2018年5月28日

【论文推荐】最新六篇图像分割相关论文—控制、全卷积网络、子空间表示、多模态图像分割

【论文推荐】最新六篇图像分割相关论文—控制、全卷积网络、子空间表示、多模态图像分割

专知

25+阅读 · 2018年4月15日

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

专知

10+阅读 · 2018年3月2日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

大规模多视角高维图像特征提取

国家自然科学基金

5+阅读 · 2017年12月31日

多目主动相机智能监控关键技术研究

国家自然科学基金

2+阅读 · 2015年12月31日

欠覆盖环境下城市多源监控视频大数据高效编码方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于多源视频的大范围场景目标跟踪

国家自然科学基金

2+阅读 · 2015年12月31日

保持结构的交互式图像及视频编辑方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

云环境下结合视觉特征的图像视频集编码与传输

国家自然科学基金

1+阅读 · 2015年12月31日

基于人类3D视觉感应的2D到3D视频转换关键技术研究

国家自然科学基金

2+阅读 · 2015年12月31日

自由视点三维视频中纹理-深度图像联合建模及应用

国家自然科学基金

0+阅读 · 2015年12月31日

多纹理多深度的3D视频码率控制研究

国家自然科学基金

0+阅读 · 2015年12月31日

智能视频监控中图像超分辨率重建关键技术研究

国家自然科学基金

4+阅读 · 2014年12月31日

LoT-Pass: Long-term-robust Image Watermarking for Image to Video Generation

Arxiv

0+阅读 · 6月23日

OrthoMotion:Disentangling Camera and Subject Motion via Geometry Semantics Orthogonal Attention

Arxiv

0+阅读 · 6月22日

Customizing Video Portraits via Identity-ActionDecoupling

Arxiv

0+阅读 · 6月21日

UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

Arxiv

0+阅读 · 6月19日

TriMotion: Modality-Agnostic Camera Control for Video Generation

Arxiv

0+阅读 · 6月18日

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Arxiv

0+阅读 · 6月18日

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

Arxiv

0+阅读 · 6月18日

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

Arxiv

0+阅读 · 6月17日

ART-VS: Adaptive Resolution Tiling for Vision Transformer Visual Servoing

Arxiv

0+阅读 · 6月17日

Image-to-Image Translation: Methods and Applications

Arxiv

17+阅读 · 2021年1月21日

VIP会员

文章信息

相关主题

查准率/准确率

最新内容

从采集到决策：美军视角下的战术情报范式重构

从采集到决策：美军视角下的战术情报范式重构

专知会员服务

10+阅读 · 8月2日

乌克兰“德尔塔”系统揭示无人机、数据与领导力如何重塑现代安全格局

乌克兰“德尔塔”系统揭示无人机、数据与领导力如何重塑现代安全格局

专知会员服务

5+阅读 · 8月2日

大规模作战中的参谋流程：作为联合兵种作战组成部分的目标锁定

大规模作战中的参谋流程：作为联合兵种作战组成部分的目标锁定

专知会员服务

8+阅读 · 8月2日

《北约概念开发与实验（CD&E）手册：概念开发者工具箱》100页手册

《北约概念开发与实验（CD&E）手册：概念开发者工具箱》100页手册

专知会员服务

10+阅读 · 8月2日

《履带式无人地面战车技术发展现状》

《履带式无人地面战车技术发展现状》

专知会员服务

5+阅读 · 8月2日

《美国空军B-2“幽灵”隐身轰炸机系统工程案例研究》117页

《美国空军B-2“幽灵”隐身轰炸机系统工程案例研究》117页

专知会员服务

8+阅读 · 8月1日

隐身技术前沿综述：物理机理、工程实践与战略展望

隐身技术前沿综述：物理机理、工程实践与战略展望

专知会员服务

6+阅读 · 8月1日

《多变海洋环境下无人水面艇与自主水下机器人对接的最优路径规划》

《多变海洋环境下无人水面艇与自主水下机器人对接的最优路径规划》

专知会员服务

6+阅读 · 8月1日

《以机反机：基于无人机载麦克风的空中周界入侵检测》

《以机反机：基于无人机载麦克风的空中周界入侵检测》

专知会员服务

6+阅读 · 8月1日

《无人机脆弱性利用：网络空间力量的新域》

《无人机脆弱性利用：网络空间力量的新域》

专知会员服务

4+阅读 · 8月1日

美空军如何将人工智能从战场部署至后方机关

美空军如何将人工智能从战场部署至后方机关

专知会员服务

13+阅读 · 7月31日

《美战争部指令文件：网络空间效应与使能能力测试评估》

《美战争部指令文件：网络空间效应与使能能力测试评估》

专知会员服务

9+阅读 · 7月31日

《史诗怒火行动：多域前瞻评估》49页报告

《史诗怒火行动：多域前瞻评估》49页报告

专知会员服务

12+阅读 · 7月31日

《英国防部：未来空战系统数字化战略》33页

《英国防部：未来空战系统数字化战略》33页

专知会员服务

7+阅读 · 7月31日

《面向自主飞行网络的智能体人工智能架构》

《面向自主飞行网络的智能体人工智能架构》

专知会员服务

10+阅读 · 7月31日

相关VIP内容

RSS 2024 | NaVid：视觉语言导航大模型

RSS 2024 | NaVid：视觉语言导航大模型

专知会员服务

34+阅读 · 2024年6月9日

【ICCV2023最佳论文】ControlNet: 为文本到图像扩散模型添加条件控制

【ICCV2023最佳论文】ControlNet: 为文本到图像扩散模型添加条件控制

专知会员服务

27+阅读 · 2023年10月4日

ICML2023 | 轻量级视觉Transformer(ViT)的预训练实践手册

ICML2023 | 轻量级视觉Transformer(ViT)的预训练实践手册

专知会员服务

42+阅读 · 2023年5月10日

【CVPR 2022】基于Transformer的图象风格化，StyTr2: Image Style Transfer with Transformers

【CVPR 2022】基于Transformer的图象风格化，StyTr2: Image Style Transfer with Transformers

专知会员服务

11+阅读 · 2022年3月19日

【CVPR 2022】可控图像合成与编辑的合成生成先验学习，SemanticStyleGAN: Learning Compositonal Generative Priors for Controllable Image Synthesis and Editing

【CVPR 2022】可控图像合成与编辑的合成生成先验学习，SemanticStyleGAN: Learning Compositonal Generative Priors for Controllable Image Synthesis and Editing

专知会员服务

23+阅读 · 2022年3月3日

【AAAI2022】基于交互式transformer和暹罗网络的视频目标分割

【AAAI2022】基于交互式transformer和暹罗网络的视频目标分割

专知会员服务

24+阅读 · 2022年2月6日

Swin Transformer重磅升级！Swin V2：向更大容量、更高分辨率的更大模型迈进

Swin Transformer重磅升级！Swin V2：向更大容量、更高分辨率的更大模型迈进

专知会员服务

28+阅读 · 2021年11月20日

[ICCV 2021] 从二到一：一种带有视觉语言建模网络的新场景文本识别器

专知会员服务

18+阅读 · 2021年10月17日

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

【论文推荐】小样本视频合成，Few-shot Video-to-Video Synthesis

专知会员服务

24+阅读 · 2019年12月15日

Understanding Color and the In-Camera Image Processing Pipeline for Computer Vision 【Michael S. Brown IEEE】韩国 ICCV 2019

Understanding Color and the In-Camera Image Processing Pipeline for Computer Vision 【Michael S. Brown IEEE】韩国 ICCV 2019

专知会员服务

10+阅读 · 2019年10月30日

热门VIP内容

开通专知VIP会员享更多权益服务

乌克兰“德尔塔”系统揭示无人机、数据与领导力如何重塑现代安全格局

《北约概念开发与实验（CD&E）手册：概念开发者工具箱》100页手册

从采集到决策：美军视角下的战术情报范式重构

大规模作战中的参谋流程：作为联合兵种作战组成部分的目标锁定

相关资讯

CVPR 2020 | 看图说话之随心所欲：细粒度可控的图像描述自动生成

CVPR 2020 | 看图说话之随心所欲：细粒度可控的图像描述自动生成

AI科技评论

14+阅读 · 2020年3月16日

使用双目相机进行三维重建第二部分：姿态估计

使用双目相机进行三维重建第二部分：姿态估计

AI研习社

12+阅读 · 2019年5月7日

新型相机DVS/Event-based camera的发展及应用

新型相机DVS/Event-based camera的发展及应用

计算机视觉life

16+阅读 · 2019年3月12日

基于 TensorFlow 、OpenCV 和 Docker 的实时视频目标检测

基于 TensorFlow 、OpenCV 和 Docker 的实时视频目标检测

AI研习社

10+阅读 · 2018年7月23日

【论文推荐】最新八篇视频描述生成相关论文—在线视频理解、联合定位和描述事件、生成视频、跨模态注意力机制、联合事件检测和描述

【论文推荐】最新八篇视频描述生成相关论文—在线视频理解、联合定位和描述事件、生成视频、跨模态注意力机制、联合事件检测和描述

专知

11+阅读 · 2018年6月4日

STRCF for Visual Object Tracking

STRCF for Visual Object Tracking

统计学习与视觉计算组

15+阅读 · 2018年5月29日

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

【论文推荐】最新六篇图像描述生成相关论文—字符级推断、视觉解释、语义对齐、实体感知、确定性非自回归

专知

15+阅读 · 2018年5月28日

【论文推荐】最新六篇图像分割相关论文—控制、全卷积网络、子空间表示、多模态图像分割

【论文推荐】最新六篇图像分割相关论文—控制、全卷积网络、子空间表示、多模态图像分割

专知

25+阅读 · 2018年4月15日

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

【论文推荐】最新六篇图像描述生成相关论文—视频摘要、注意力张量积、非自回归神经序列模型、副词识别、多主体、多样性度量

专知

10+阅读 · 2018年3月2日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

相关论文

LoT-Pass: Long-term-robust Image Watermarking for Image to Video Generation

Arxiv

0+阅读 · 6月23日

OrthoMotion:Disentangling Camera and Subject Motion via Geometry Semantics Orthogonal Attention

Arxiv

0+阅读 · 6月22日

Customizing Video Portraits via Identity-ActionDecoupling

Arxiv

0+阅读 · 6月21日

UnityShots: Memory-Driven Multi-Shot Audio-Video Generation with Boundary-Aware Gating

Arxiv

0+阅读 · 6月19日

TriMotion: Modality-Agnostic Camera Control for Video Generation

Arxiv

0+阅读 · 6月18日

Holo-World: Unified Camera, Object and Weather Control for Video World Model

Arxiv

0+阅读 · 6月18日

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

Arxiv

0+阅读 · 6月18日

ImageWAM: Do World Action Models Really Need Video Generation, or Just Image Editing?

Arxiv

0+阅读 · 6月17日

ART-VS: Adaptive Resolution Tiling for Vision Transformer Visual Servoing

Arxiv

0+阅读 · 6月17日

Image-to-Image Translation: Methods and Applications

Arxiv

17+阅读 · 2021年1月21日

相关基金

大规模多视角高维图像特征提取

国家自然科学基金

5+阅读 · 2017年12月31日

多目主动相机智能监控关键技术研究

国家自然科学基金

2+阅读 · 2015年12月31日

欠覆盖环境下城市多源监控视频大数据高效编码方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于多源视频的大范围场景目标跟踪

国家自然科学基金

2+阅读 · 2015年12月31日

保持结构的交互式图像及视频编辑方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

云环境下结合视觉特征的图像视频集编码与传输

国家自然科学基金

1+阅读 · 2015年12月31日

基于人类3D视觉感应的2D到3D视频转换关键技术研究

国家自然科学基金

2+阅读 · 2015年12月31日

自由视点三维视频中纹理-深度图像联合建模及应用

国家自然科学基金

0+阅读 · 2015年12月31日

多纹理多深度的3D视频码率控制研究

国家自然科学基金

0+阅读 · 2015年12月31日

智能视频监控中图像超分辨率重建关键技术研究

国家自然科学基金

4+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员