Current video text spotting methods can achieve preferable performance, powered with sufficient labeled training data. However, labeling data manually is time-consuming and labor-intensive. To overcome this, using low-cost synthetic data is a promising alternative. This paper introduces a novel video text synthesis technique called FlowText, which utilizes optical flow estimation to synthesize a large amount of text video data at a low cost for training robust video text spotters. Unlike existing methods that focus on image-level synthesis, FlowText concentrates on synthesizing temporal information of text instances across consecutive frames using optical flow. This temporal information is crucial for accurately tracking and spotting text in video sequences, including text movement, distortion, appearance, disappearance, shelter, and blur. Experiments show that combining general detectors like TransDETR with the proposed FlowText produces remarkable results on various datasets, such as ICDAR2015video and ICDAR2013video. Code is available at https://github.com/callsys/FlowText.
翻译:当前的视频文字检测方法在充足标注训练数据的支持下已取得优异表现。然而,人工标注数据耗时耗力。为克服这一难题,利用低成本的合成数据成为一种有前景的替代方案。本文提出一种新颖的视频文字合成技术FlowText,它通过光流估计以低成本合成大量文字视频数据,用于训练鲁棒的视频文字检测器。与现有侧重于图像级合成的方法不同,FlowText专注于利用光流合成连续帧间文字实例的时序信息。这种时序信息对于精准跟踪和检测视频序列中的文字至关重要,包括文字运动、形变、出现、消失、遮挡及模糊等动态特性。实验表明,将TransDETR等通用检测器与所提出的FlowText相结合,在ICDAR2015video和ICDAR2013video等多个数据集上取得了显著效果。相关代码已开源至https://github.com/callsys/FlowText。