Businesses need to query visually rich documents (VRDs) like receipts, medical records, and insurance forms to make decisions. Existing techniques for extracting entities from VRDs struggle with new layouts or require extensive pre-training data. We introduce VRDSynth, a program synthesis method to automatically extract entity relations from multilingual VRDs without pre-training data. To capture the complexity of VRD domain, we design a domain-specific language (DSL) to capture spatial and textual relations to describe the synthesized programs. Along with this, we also derive a new synthesis algorithm utilizing frequent spatial relations, search space pruning, and a combination of positive, negative, and exclusive programs to improve coverage. We evaluate VRDSynth on the FUNSD and XFUND benchmarks for semantic entity linking, consisting of 1,592 forms in 8 languages. VRDSynth outperforms state-of-the-art pre-trained models (LayoutXLM, InfoXLMBase, and XLMRobertaBase) in 5, 6, and 7 out of 8 languages, respectively, improving the F1 score by 42% over LayoutXLM in English. To test the extensibility of the model, we further improve VRDSynth with automated table recognition, creating VRDSynth(Table), and compare it with extended versions of the pre-trained models, InfoXLM(Large) and XLMRoberta(Large). VRDSynth(Table) outperforms these baselines in 4 out of 8 languages and in average F1 score. VRDSynth also significantly reduces memory footprint (1M and 380MB vs. 1.48GB and 3GB for LayoutXLM) while maintaining similar time efficiency.
翻译:企业需要查询收据、医疗记录和保险表单等视觉丰富文档(VRD)以支持决策。现有从VRD中抽取实体的技术难以应对新布局,或需要大量预训练数据。本文提出VRDSynth,一种无需预训练数据即可从多语言VRD中自动抽取实体关系的程序合成方法。为刻画VRD领域的复杂性,我们设计了一种领域特定语言(DSL),通过捕捉空间与文本关系来描述合成程序。基于此,我们提出一种新型合成算法,该算法利用频繁空间关系、搜索空间剪枝以及正例/负例/互斥程序的组合策略来提升覆盖范围。我们在FUNSD和XFUND基准测试上评估VRDSynth的语义实体链接能力,数据集包含8种语言的1,592份表单。VRDSynth在8种语言中分别有5种、6种和7种语言的表现优于当前最先进的预训练模型(LayoutXLM、InfoXLMBase和XLMRobertaBase),其中英语的F1分数较LayoutXLM提升42%。为测试模型可扩展性,我们进一步集成自动表格识别功能改进VRDSynth,构建VRDSynth(Table),并与预训练模型的扩展版本InfoXLM(Large)和XLMRoberta(Large)进行对比。VRDSynth(Table)在8种语言中的4种语言以及平均F1分数上均优于基线模型。同时,VRDSynth在保持相近时间效率的前提下,显著降低了内存占用(1M参数/380MB内存,对比LayoutXLM的1.48GB/3GB内存)。