With the rise of short video platforms represented by TikTok, the trend of users expressing their creativity through photos and videos has increased dramatically. However, ordinary users lack the professional skills to produce high-quality videos using professional creation software. To meet the demand for intelligent and user-friendly video creation tools, we propose the Dynamic Visual Composition (DVC) task, an interesting and challenging task that aims to automatically integrate various media elements based on user requirements and create storytelling videos. We propose an Intelligent Director framework, utilizing LENS to generate descriptions for images and video frames and combining ChatGPT to generate coherent captions while recommending appropriate music names. Then, the best-matched music is obtained through music retrieval. Then, materials such as captions, images, videos, and music are integrated to seamlessly synthesize the video. Finally, we apply AnimeGANv2 for style transfer. We construct UCF101-DVC and Personal Album datasets and verified the effectiveness of our framework in solving DVC through qualitative and quantitative comparisons, along with user studies, demonstrating its substantial potential.
翻译:随着以TikTok为代表的短视频平台兴起,用户通过照片和视频表达创意的趋势大幅增长。然而,普通用户缺乏使用专业创作软件制作高质量视频的专业技能。为满足智能化且易用的视频创作工具需求,我们提出了动态视觉合成(DVC)任务——一项兼具趣味性与挑战性的任务,旨在根据用户需求自动整合多种媒体元素并生成叙事性视频。我们提出智能导演框架,利用LENS为图像和视频帧生成描述,结合ChatGPT生成连贯字幕并推荐合适的音乐名称,随后通过音乐检索获取最佳匹配曲目,再将字幕、图像、视频和音乐等素材无缝整合为完整视频,最后应用AnimeGANv2进行风格迁移。我们构建了UCF101-DVC和Personal Album数据集,通过定性与定量比较及用户研究验证了该框架在解决DVC任务中的有效性,展现了其巨大潜力。