This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.
翻译:本文提出VERSE方法,通过探索视觉-语言模型在视觉丰富文档理解任务中的视觉嵌入空间,实现模型分析与性能提升。VERSE能够可视化潜在表征,辅助评估模型可行性,同时支持识别问题区域并指导合成数据生成以优化特定簇群的性能。我们通过在合成数据集MERIT上进行训练,并在其实世界对应数据集MERIT Secret上进行评估,验证了该方法的有效性。结果表明:VERSE能够揭示易出错簇群相关的视觉特征,而利用包含这些特征的样本进行重新训练,可在不损害泛化能力的前提下显著提升F1性能。此外,研究证明经VERSE优化的本地模型(如Donut和Idefics2)能够达到甚至超越GPT-4、Pixtral等SaaS解决方案的性能水平。