Analog magnetic tapes have been the main video data storage device for several decades. Videos stored on analog videotapes exhibit unique degradation patterns caused by tape aging and reader device malfunctioning that are different from those observed in film and digital video restoration tasks. In this work, we present a reference-based approach for the resToration of digitized Analog videotaPEs (TAPE). We leverage CLIP for zero-shot artifact detection to identify the cleanest frames of each video through textual prompts describing different artifacts. Then, we select the clean frames most similar to the input ones and employ them as references. We design a transformer-based Swin-UNet network that exploits both neighboring and reference frames via our Multi-Reference Spatial Feature Fusion (MRSFF) blocks. MRSFF blocks rely on cross-attention and attention pooling to take advantage of the most useful parts of each reference frame. To address the absence of ground truth in real-world videos, we create a synthetic dataset of videos exhibiting artifacts that closely resemble those commonly found in analog videotapes. Both quantitative and qualitative experiments show the effectiveness of our approach compared to other state-of-the-art methods. The code, the model, and the synthetic dataset are publicly available at https://github.com/miccunifi/TAPE.
翻译:模拟磁带作为主要视频数据存储设备已有数十年历史。存储于模拟磁带上的视频因磁带老化和读取设备故障而表现出独特的退化模式,这与胶片和数字视频修复任务中观察到的退化有所不同。本文提出一种基于参考的数字化模拟磁带修复方法(TAPE)。我们利用CLIP实现零样本伪影检测,通过描述不同伪影的文本提示识别每个视频中最干净的帧。随后,选取与输入帧最相似的干净帧作为参考。我们设计了一个基于Transformer的Swin-UNet网络,通过多参考空间特征融合(MRSFF)模块同时利用相邻帧和参考帧。MRSFF模块依赖交叉注意力与注意力池化,充分利用每个参考帧中最有价值的局部信息。为解决真实视频缺乏真值的问题,我们构建了一个合成视频数据集,其伪影特征与模拟磁带中常见伪影高度相似。定量与定性实验均表明,与现有最先进方法相比,本方法具有显著优势。代码、模型及合成数据集已公开于 https://github.com/miccunifi/TAPE。