This work addresses the need for a balanced approach between performance and efficiency in scalable production environments for visually-rich document understanding (VDU) tasks. Currently, there is a reliance on large document foundation models that offer advanced capabilities but come with a heavy computational burden. In this paper, we propose a multimodal early exit (EE) model design that incorporates various training strategies, exit layer types and placements. Our goal is to achieve a Pareto-optimal balance between predictive performance and efficiency for multimodal document image classification. Through a comprehensive set of experiments, we compare our approach with traditional exit policies and showcase an improved performance-efficiency trade-off. Our multimodal EE design preserves the model's predictive capabilities, enhancing both speed and latency. This is achieved through a reduction of over 20% in latency, while fully retaining the baseline accuracy. This research represents the first exploration of multimodal EE design within the VDU community, highlighting as well the effectiveness of calibration in improving confidence scores for exiting at different layers. Overall, our findings contribute to practical VDU applications by enhancing both performance and efficiency.
翻译:本研究针对视觉丰富文档理解(VDU)任务中,在可扩展生产环境下实现性能与效率平衡的需求。当前,大型文档基础模型虽然具备先进能力,但带来沉重计算负担。本文提出一种多模态早退(EE)模型设计,融合多种训练策略、退出层类型及其位置部署,旨在为多模态文档图像分类实现预测性能与效率的帕累托最优平衡。通过系统性实验,我们将所提方法与传统退出策略进行对比,展示了更优的性能-效率权衡。该多模态EE设计在保持模型预测能力的同时,显著提升处理速度并降低延迟——在完全保留基线精度的前提下实现超20%的延迟削减。本研究是VDU领域对多模态EE设计的首次探索,同时验证了校准机制在提升不同层级退出置信度评分中的有效性。总体而言,本文通过兼顾性能与效率的优化,为VDU实用化应用提供了实质性贡献。