TEXTRON: Weakly Supervised Multilingual Text Detection through Data Programming

Several recent deep learning (DL) based techniques perform considerably well on image-based multilingual text detection. However, their performance relies heavily on the availability and quality of training data. There are numerous types of page-level document images consisting of information in several modalities, languages, fonts, and layouts. This makes text detection a challenging problem in the field of computer vision (CV), especially for low-resource or handwritten languages. Furthermore, there is a scarcity of word-level labeled data for text detection, especially for multilingual settings and Indian scripts that incorporate both printed and handwritten text. Conventionally, Indian script text detection requires training a DL model on plenty of labeled data, but to the best of our knowledge, no relevant datasets are available. Manual annotation of such data requires a lot of time, effort, and expertise. In order to solve this problem, we propose TEXTRON, a Data Programming-based approach, where users can plug various text detection methods into a weak supervision-based learning framework. One can view this approach to multilingual text detection as an ensemble of different CV-based techniques and DL approaches. TEXTRON can leverage the predictions of DL models pre-trained on a significant amount of language data in conjunction with CV-based methods to improve text detection in other languages. We demonstrate that TEXTRON can improve the detection performance for documents written in Indian languages, despite the absence of corresponding labeled data. Further, through extensive experimentation, we show improvement brought about by our approach over the current State-of-the-art (SOTA) models, especially for handwritten Devanagari text. Code and dataset has been made available at https://github.com/IITB-LEAP-OCR/TEXTRON

翻译：近年来，多项基于深度学习（DL）的技术在图像多语言文本检测中表现出色。然而，其性能高度依赖于训练数据的可用性和质量。各类页面级文档图像包含多种模态、语言、字体和排版的信息，使得文本检测成为计算机视觉（CV）领域的一个挑战性问题，尤其对于低资源语言或手写语言而言。此外，文本检测任务中缺乏词级标注数据，在多语言场景以及同时包含印刷体和手写文本的印度文字中尤为突出。传统上，印度文字的文本检测需要大量标注数据来训练深度学习模型，但据我们所知，目前尚无相关的公开数据集。人工标注此类数据需耗费大量时间、精力和专业知识。为解决这一问题，我们提出TEXTRON——一种基于数据编程的方法，用户可将多种文本检测方法集成到弱监督学习框架中。该方法可被视为多种计算机视觉技术和深度学习模型的集成方案。TEXTRON通过利用预训练于大量语言数据的深度学习模型的预测结果，结合基于计算机视觉的方法，提升其他语言的文本检测性能。实验表明，即便缺乏对应标注数据，TEXTRON仍能显著改善印度语言文档的检测效果。进一步的大量实验验证，我们的方法在现有最先进（SOTA）模型基础上实现了性能提升，尤其针对手写天城文文本。代码与数据集已开源至 https://github.com/IITB-LEAP-OCR/TEXTRON。