This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, the advanced inpainting and editing abilities of diffusion models make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Additionally, our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the TATI task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
翻译:本文介绍AnyTrans,一个面向图像任意文本翻译任务的全方位框架,该任务涵盖图像中的多语言文本翻译与文本融合。我们的框架利用大规模模型(如大语言模型和文本引导扩散模型)的优势,在翻译过程中整合文本与视觉元素的上下文信息。LLM的小样本学习能力使其能够通过考虑整体语境来翻译片段化文本。同时,扩散模型先进的修复与编辑能力使得翻译后的文本能够无缝融合至原始图像,同时保持其风格与真实感。此外,本框架可完全基于开源模型构建且无需训练,具有高度可访问性与易扩展性。为促进TATI任务的发展,我们精心构建了名为MTIT6的测试数据集,其中包含六组语言对的多语言文本图像翻译数据。