When extracting structured data from repetitively organized documents, such as dictionaries, directories, or even newspapers, a key challenge is to correctly segment what constitutes the basic text regions for the target database. Traditionally, such a problem was tackled as part of the layout analysis and was mostly based on visual clues for dividing (top-down) approaches. Some agglomerating (bottom-up) approaches started to consider textual information to link similar contents, but they required a proper over-segmentation of fine-grained units. In this work, we propose a new pragmatic approach whose efficiency is demonstrated on 19th century French Trade Directories. We propose to consider two sub-problems: coarse layout detection (text columns and reading order), which is assumed to be effective and not detailed here, and a fine-grained entry separation stage for which we propose to adapt a state-of-the-art Named Entity Recognition (NER) approach. By injecting special visual tokens, coding, for instance, indentation or breaks, into the token stream of the language model used for NER purpose, we can leverage both textual and visual knowledge simultaneously. Code, data, results and models are available at https://github.com/soduco/paper-entryseg-icdar23-code, https://huggingface.co/HueyNemud/ (icdar23-entrydetector* variants)
翻译:在从具有重复性组织的文档(如词典、名录甚至报纸)中提取结构化数据时,一个关键挑战在于正确分割构成目标数据库基础文本区域的条目。传统上,此类问题作为版面分析的一部分被处理,主要基于视觉线索进行自上而下的划分方法。部分自下而上的聚合方法开始利用文本信息来关联相似内容,但需要先对细粒度单元进行合理的过分割。本文提出了一种新的实用方法,并在19世纪法国商业名录上验证了其有效性。我们将问题分解为两个子问题:粗粒度版面检测(文本列与阅读顺序,假设已有成熟方案,本文不详细展开)以及细粒度的条目分割阶段——针对后者,我们提出了一种基于当前最先进的命名实体识别(NER)方法的改进方案。通过向用于NER的语言模型的标记流中注入特殊的视觉标记(例如编码缩进或换行),我们能够同时利用文本与视觉知识。代码、数据、结果及模型均可在以下链接获取:https://github.com/soduco/paper-entryseg-icdar23-code,https://huggingface.co/HueyNemud/ (icdar23-entrydetector*变体)