Text image machine translation (TIMT) aims to translate texts embedded in images from one source language to another target language. Existing methods, both two-stage cascade and one-stage end-to-end architectures, suffer from different issues. The cascade models can benefit from the large-scale optical character recognition (OCR) and MT datasets but the two-stage architecture is redundant. The end-to-end models are efficient but suffer from training data deficiency. To this end, in our paper, we propose an end-to-end TIMT model fully making use of the knowledge from existing OCR and MT datasets to pursue both an effective and efficient framework. More specifically, we build a novel modal adapter effectively bridging the OCR encoder and MT decoder. End-to-end TIMT loss and cross-modal contrastive loss are utilized jointly to align the feature distribution of the OCR and MT tasks. Extensive experiments show that the proposed method outperforms the existing two-stage cascade models and one-stage end-to-end models with a lighter and faster architecture. Furthermore, the ablation studies verify the generalization of our method, where the proposed modal adapter is effective to bridge various OCR and MT models.
翻译:文本图像机器翻译(TIMT)旨在将图像中嵌入的文本从源语言翻译为目标语言。现有方法,包括两阶段级联架构和单阶段端到端架构,均面临不同问题。级联模型可受益于大规模光学字符识别(OCR)与机器翻译(MT)数据集,但两阶段架构存在冗余;端到端模型虽高效却受限于训练数据不足。为此,本文提出一种端到端TIMT模型,充分利用现有OCR与MT数据集的知识,以兼顾框架的有效性与高效性。具体而言,我们构建了一种新型模态适配器,有效衔接OCR编码器与MT解码器。通过联合优化端到端TIMT损失与跨模态对比损失,对齐OCR任务与MT任务的特征分布。大量实验表明,所提方法以更轻量、更快速的架构优于现有两阶段级联模型与单阶段端到端模型。此外,消融研究验证了方法的泛化性——所提模态适配器可有效桥接多种OCR与MT模型。