Video text spotting refers to localizing, recognizing, and tracking textual elements such as captions, logos, license plates, signs, and other forms of text within consecutive video frames. However, current datasets available for this task rely on quadrilateral ground truth annotations, which may result in including excessive background content and inaccurate text boundaries. Furthermore, methods trained on these datasets often produce prediction results in the form of quadrilateral boxes, which limits their ability to handle complex scenarios such as dense or curved text. To address these issues, we propose a scalable mask annotation pipeline called SAMText for video text spotting. SAMText leverages the SAM model to generate mask annotations for scene text images or video frames at scale. Using SAMText, we have created a large-scale dataset, SAMText-9M, that contains over 2,400 video clips sourced from existing datasets and over 9 million mask annotations. We have also conducted a thorough statistical analysis of the generated masks and their quality, identifying several research topics that could be further explored based on this dataset. The code and dataset will be released at \url{https://github.com/ViTAE-Transformer/SAMText}.
翻译:视频文本检测是指在连续视频帧中定位、识别并跟踪字幕、标识、车牌、路牌及其他形式的文字元素。然而,当前用于该任务的数据集依赖于四边形真实标注,这可能导致包含过多背景内容且文本边界不精准。此外,基于这些数据集训练的方法通常输出四边形框形式的预测结果,限制了其在密集文本或弯曲文本等复杂场景中的处理能力。为解决这些问题,我们提出了一种名为SAMText的可扩展掩码标注流程,用于视频文本检测。SAMText利用SAM模型大规模生成场景文本图像或视频帧的掩码标注。通过SAMText,我们构建了一个大规模数据集SAMText-9M,该数据集包含来自现有数据集的2400多个视频片段及超过900万个掩码标注。我们还对生成的掩码及其质量进行了全面的统计分析,并指出了基于该数据集可进一步探索的若干研究方向。代码与数据集将在https://github.com/ViTAE-Transformer/SAMText 公开提供。