Extracting structured information from videos is critical for numerous downstream applications in the industry. In this paper, we define a significant task of extracting hierarchical key information from visual texts on videos. To fulfill this task, we decouple it into four subtasks and introduce two implementation solutions called PipVKIE and UniVKIE. PipVKIE sequentially completes the four subtasks in continuous stages, while UniVKIE is improved by unifying all the subtasks into one backbone. Both PipVKIE and UniVKIE leverage multimodal information from vision, text, and coordinates for feature representation. Extensive experiments on one well-defined dataset demonstrate that our solutions can achieve remarkable performance and efficient inference speed.
翻译:从视频中提取结构化信息对工业界众多下游应用至关重要。本文定义了一项重要任务:从视频视觉文本中提取层级化关键信息。为实现该任务,我们将其解耦为四个子任务,并引入两种实现方案:PipVKIE与UniVKIE。PipVKIE通过连续阶段逐步完成四个子任务,而UniVKIE通过将全部子任务统一至单一主干网络实现改进。两种方案均利用视觉、文本与坐标的多模态信息进行特征表示。在标准数据集上的大量实验表明,我们的方案能够实现显著性能与高效推理速度。