This article introduces Lester, a novel method to automatically synthetise retro-style 2D animations from videos. The method approaches the challenge mainly as an object segmentation and tracking problem. Video frames are processed with the Segment Anything Model (SAM) and the resulting masks are tracked through subsequent frames with DeAOT, a method of hierarchical propagation for semi-supervised video object segmentation. The geometry of the masks' contours is simplified with the Douglas-Peucker algorithm. Finally, facial traits, pixelation and a basic shadow effect can be optionally added. The results show that the method exhibits an excellent temporal consistency and can correctly process videos with different poses and appearances, dynamic shots, partial shots and diverse backgrounds. The proposed method provides a more simple and deterministic approach than diffusion models based video-to-video translation pipelines, which suffer from temporal consistency problems and do not cope well with pixelated and schematic outputs. The method is also much most practical than techniques based on 3D human pose estimation, which require custom handcrafted 3D models and are very limited with respect to the type of scenes they can process.
翻译:本文介绍Lester,一种从视频中自动合成复古风格2D动画的新方法。该方法主要将挑战视为目标分割与跟踪问题。视频帧通过Segment Anything Model (SAM)处理,产生的掩膜经由DeAOT(一种面向半监督视频目标分割的层次化传播方法)在后续帧中持续追踪。掩膜轮廓的几何形状通过道格拉斯-普克算法进行简化。最后,可选择性添加面部特征、像素化效果及基础阴影效果。结果表明,该方法展现出卓越的时间一致性,能正确处理不同姿态与外观、动态镜头、局部镜头及多样化背景的视频。相较于基于扩散模型的视频到视频翻译流水线(该流水线存在时间一致性问题且难以处理像素化及示意图输出),本方法提供更简单且确定性的解决方案。同时,该方法也比基于3D人体姿态估计的技术(需定制手工3D模型且对场景类型处理极其受限)更具实用性。