LOGO: Video Text Spotting with Language Collaboration and Glyph Perception Model

Video text spotting aims to simultaneously localize, recognize and track text instances in videos. To address the limited recognition capability of end-to-end methods, tracking the zero-shot results of state-of-the-art image text spotters directly can achieve impressive performance. However, owing to the domain gap between different datasets, these methods usually obtain limited tracking trajectories on extreme dataset. Fine-tuning transformer-based text spotters on specific datasets could yield performance enhancements, albeit at the expense of considerable training resources. In this paper, we propose a Language Collaboration and Glyph Perception Model, termed LOGO to enhance the performance of conventional text spotters through the integration of a synergy module. To achieve this goal, a language synergy classifier (LSC) is designed to explicitly discern text instances from background noise in the recognition stage. Specially, the language synergy classifier can output text content or background code based on the legibility of text regions, thus computing language scores. Subsequently, fusion scores are computed by taking the average of detection scores and language scores, and are utilized to re-score the detection results before tracking. By the re-scoring mechanism, the proposed LSC facilitates the detection of low-resolution text instances while filtering out text-like regions. Besides, the glyph supervision and visual position mixture module are proposed to enhance the recognition accuracy of noisy text regions, and acquire more discriminative tracking features, respectively. Extensive experiments on public benchmarks validate the effectiveness of the proposed method.

翻译：视频文本定位旨在同时对视频中的文本实例进行定位、识别与跟踪。为解决端到端方法识别能力有限的问题，直接跟踪先进图像文本定位器的零样本结果可获得显著性能。然而，由于不同数据集间的领域差异，这些方法在极端数据集上通常只能获得有限的跟踪轨迹。在特定数据集上微调基于Transformer的文本定位器虽能提升性能，但需耗费大量训练资源。本文提出一种语言协同与字形感知模型（LOGO），通过集成协同模块来增强传统文本定位器的性能。为实现这一目标，我们设计了语言协同分类器（LSC），在识别阶段显式区分文本实例与背景噪声。具体而言，该分类器可根据文本区域的可读性输出文本内容或背景代码，从而计算语言得分。随后，通过取检测得分与语言得分的平均值计算融合得分，并用于在跟踪前对检测结果进行重评分。通过这种重评分机制，所提出的LSC能够促进低分辨率文本实例的检测，同时滤除类文本区域。此外，本文还提出字形监督模块与视觉位置混合模块，分别用于提升噪声文本区域的识别精度和获取更具判别力的跟踪特征。在公开基准测试上的大量实验验证了所提方法的有效性。