Table structure recognition (TSR) aims to convert tabular images into a machine-readable format, where a visual encoder extracts image features and a textual decoder generates table-representing tokens. Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder. However, this hybrid CNN-Transformer architecture introduces a complex visual encoder that accounts for nearly half of the total model parameters, markedly reduces both training and inference speed, and hinders the potential for self-supervised learning in TSR. In this work, we design a lightweight visual encoder for TSR without sacrificing expressive power. We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model. The convolutional stem strikes an optimal balance between two crucial factors for high-performance TSR: a higher receptive field (RF) ratio and a longer sequence length. This allows it to "see" an appropriate portion of the table and "store" the complex table structure within sufficient context length for the subsequent transformer. We conducted reproducible ablation studies and open-sourced our code at https://github.com/poloclub/tsr-convstem to enhance transparency, inspire innovations, and facilitate fair comparisons in our domain as tables are a promising modality for representation learning.
翻译:表格结构识别(TSR)旨在将表格图像转换为机器可读格式,其中视觉编码器提取图像特征,文本解码器生成表示表格的令牌。现有方法采用经典卷积神经网络(CNN)骨干作为视觉编码器,并采用Transformer作为文本解码器。然而,这种混合CNN-Transformer架构引入了复杂的视觉编码器,其参数约占模型总参数的一半,显著降低了训练和推理速度,并阻碍了自监督学习在TSR中的应用潜力。本文中,我们设计了一种轻量级视觉编码器用于TSR,且不牺牲表达能力。我们发现,卷积主干网络能以更简单的模型匹配经典CNN骨干的性能。卷积主干网络在高性能TSR的两个关键因素——更高感受野(RF)比率和更长序列长度——之间取得了最佳平衡。这使得它能够“看到”表格的适当部分,并在后续Transformer的足够上下文长度内“存储”复杂的表格结构。我们进行了可复现的消融研究,并将代码开源至 https://github.com/poloclub/tsr-convstem ,以增强透明度、激发创新,并促进该领域的公平比较,因为表格作为表示学习的一种有前景的模态。