The most recent efforts in video matting have focused on eliminating trimap dependency since trimap annotations are expensive and trimap-based methods are less adaptable for real-time applications. Despite the latest tripmap-free methods showing promising results, their performance often degrades when dealing with highly diverse and unstructured videos. We address this limitation by introducing Adaptive Matting for Dynamic Videos, termed AdaM, which is a framework designed for simultaneously differentiating foregrounds from backgrounds and capturing alpha matte details of human subjects in the foreground. Two interconnected network designs are employed to achieve this goal: (1) an encoder-decoder network that produces alpha mattes and intermediate masks which are used to guide the transformer in adaptively decoding foregrounds and backgrounds, and (2) a transformer network in which long- and short-term attention combine to retain spatial and temporal contexts, facilitating the decoding of foreground details. We benchmark and study our methods on recently introduced datasets, showing that our model notably improves matting realism and temporal coherence in complex real-world videos and achieves new best-in-class generalizability. Further details and examples are available at https://github.com/microsoft/AdaM.
翻译:最 新的视频抠图研究致力于消除对三分图的依赖,因为三分图的标注成本高昂,且基于三分图的方法在实时应用中适应性较差。尽管最新的无三分图方法已展现出前景,但在处理高度多样化和非结构化视频时,其性能往往会下降。为解决这一局限,我们提出了面向动态视频的自适应抠图框架(简称AdaM),该框架旨在同时区分前景与背景,并捕捉前景人物的Alpha抠图细节。为实现这一目标,我们设计了两种相互关联的网络结构:(1)编码器-解码器网络:生成Alpha抠图和中间掩码,用于指导Transformer自适应地解码前景与背景;(2)Transformer网络:通过长短期注意力机制的协同作用,保持空间与时间上下文信息,从而促进前景细节的解码。我们在最新公开数据集上对方法进行了基准测试与性能分析,结果表明我们的模型在复杂真实视频场景中显著提升了抠图真实感与时间连贯性,并达到了新的最佳泛化性能。更多细节与示例请访问https://github.com/microsoft/AdaM。