Extracting tables from documents is a crucial task in any document conversion pipeline. Recently, transformer-based models have demonstrated that table-structure can be recognized with impressive accuracy using Image-to-Markup-Sequence (Im2Seq) approaches. Taking only the image of a table, such models predict a sequence of tokens (e.g. in HTML, LaTeX) which represent the structure of the table. Since the token representation of the table structure has a significant impact on the accuracy and run-time performance of any Im2Seq model, we investigate in this paper how table-structure representation can be optimised. We propose a new, optimised table-structure language (OTSL) with a minimized vocabulary and specific rules. The benefits of OTSL are that it reduces the number of tokens to 5 (HTML needs 28+) and shortens the sequence length to half of HTML on average. Consequently, model accuracy improves significantly, inference time is halved compared to HTML-based models, and the predicted table structures are always syntactically correct. This in turn eliminates most post-processing needs.
翻译:从文档中提取表格是任何文档转换流程中的关键任务。近年来,基于Transformer的模型在图像到标记序列(Im2Seq)方法中展现了极高的准确性,能够识别表格结构。这类模型仅通过表格图像,即可预测代表表格结构的标记序列(例如HTML、LaTeX格式)。由于表格结构的标记表示对任何Im2Seq模型的准确性和运行时性能具有显著影响,本文探讨了如何优化表格结构表示。我们提出了一种新的优化表格结构语言(OTSL),其拥有精简的词汇表及特定规则。OTSL的优势在于:将标记数量减少至5个(HTML需要28个以上),平均序列长度缩短至HTML的一半。因此,模型准确性显著提升,推理时间相比基于HTML的模型减少一半,且预测的表格结构始终符合语法规则。这进一步消除了大部分后处理需求。