A large amount of annotated training images is critical for training accurate and robust deep network models but the collection of a large amount of annotated training images is often time-consuming and costly. Image synthesis alleviates this constraint by generating annotated training images automatically by machines which has attracted increasing interest in the recent deep learning research. We develop an innovative image synthesis technique that composes annotated training images by realistically embedding foreground objects of interest (OOI) into background images. The proposed technique consists of two key components that in principle boost the usefulness of the synthesized images in deep network training. The first is context-aware semantic coherence which ensures that the OOI are placed around semantically coherent regions within the background image. The second is harmonious appearance adaptation which ensures that the embedded OOI are agreeable to the surrounding background from both geometry alignment and appearance realism. The proposed technique has been evaluated over two related but very different computer vision challenges, namely, scene text detection and scene text recognition. Experiments over a number of public datasets demonstrate the effectiveness of our proposed image synthesis technique - the use of our synthesized images in deep network training is capable of achieving similar or even better scene text detection and scene text recognition performance as compared with using real images.
翻译:大量带注释的训练图像对于训练准确且鲁棒的深度网络模型至关重要,但收集大量带注释的训练图像通常耗时且成本高昂。图像合成通过机器自动生成带注释的训练图像来缓解这一限制,近年来在深度学习研究中引起了越来越多的关注。我们开发了一种创新的图像合成技术,通过将感兴趣的前景对象真实地嵌入到背景图像中,来合成带注释的训练图像。所提出的技术包含两个关键组成部分,它们在原则上提升了合成图像在深度网络训练中的实用性。第一个是上下文感知的语义一致性,它确保感兴趣的前景对象被放置在背景图像中语义一致的区域附近。第二个是和谐外观适配,它确保嵌入的前景对象在几何对齐和外观真实性方面与周围背景协调一致。该技术已在两个相关但截然不同的计算机视觉挑战——场景文本检测和场景文本识别——上进行了评估。在多个公开数据集上的实验证明了我们提出的图像合成技术的有效性——与使用真实图像相比,在深度网络训练中使用我们合成的图像能够达到相似甚至更优的场景文本检测和场景文本识别性能。