Table structure recognition aims to extract the logical and physical structure of unstructured table images into a machine-readable format. The latest end-to-end image-to-text approaches simultaneously predict the two structures by two decoders, where the prediction of the physical structure (the bounding boxes of the cells) is based on the representation of the logical structure. However, the previous methods struggle with imprecise bounding boxes as the logical representation lacks local visual information. To address this issue, we propose an end-to-end sequential modeling framework for table structure recognition called VAST. It contains a novel coordinate sequence decoder triggered by the representation of the non-empty cell from the logical structure decoder. In the coordinate sequence decoder, we model the bounding box coordinates as a language sequence, where the left, top, right and bottom coordinates are decoded sequentially to leverage the inter-coordinate dependency. Furthermore, we propose an auxiliary visual-alignment loss to enforce the logical representation of the non-empty cells to contain more local visual details, which helps produce better cell bounding boxes. Extensive experiments demonstrate that our proposed method can achieve state-of-the-art results in both logical and physical structure recognition. The ablation study also validates that the proposed coordinate sequence decoder and the visual-alignment loss are the keys to the success of our method.
翻译:表格结构识别旨在将非结构化表格图像的逻辑结构和物理结构提取为机器可读格式。最新的端到端图像到文本方法通过两个解码器同时预测这两种结构,其中物理结构(单元格边界框)的预测基于逻辑结构的表示。然而,先前的方法由于逻辑表示缺乏局部视觉信息,往往导致边界框不精确。为解决这一问题,我们提出了一种名为VAST的端到端表格结构识别序列建模框架。该框架包含一个新颖的坐标序列解码器,其由逻辑结构解码器中非空单元格的表示触发。在坐标序列解码器中,我们将边界框坐标建模为语言序列,其中左、上、右、下坐标按顺序解码以利用坐标间依赖关系。此外,我们提出一种辅助视觉对齐损失,强制非空单元格的逻辑表示包含更多局部视觉细节,从而有助于生成更优的单元格边界框。大量实验表明,所提方法在逻辑结构和物理结构识别中均能达到当前最优水平。消融研究也验证了所提坐标序列解码器和视觉对齐损失是该方法成功的关键。