Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception, it also leads to longer sequences of visual tokens, increasing computational costs and straining the models' ability to handle long contexts. To address these challenges, we introduce DocKylin, a document-centric MLLM that performs visual content slimming at both the pixel and token levels, thereby reducing token sequence length in VDU scenarios. DocKylin utilizes an Adaptive Pixel Slimming (APS) preprocessing module to perform pixel-level slimming, increasing the proportion of informative pixels. Moreover, DocKylin incorporates a novel Dynamic Token Slimming (DTS) module to conduct token-level slimming, filtering essential tokens and removing others to create a compressed, adaptive visual sequence. Experiments demonstrate DocKylin's promising performance across various VDU benchmarks. Notably, both the proposed APS and DTS are parameter-free, facilitating easy integration into existing MLLMs, and our experiments indicate their potential for broader applications.
翻译:当前的多模态大语言模型(MLLM)在处理视觉文档理解(VDU)任务时面临重大挑战,这源于文档图像通常具有高分辨率、密集文本和复杂布局的特点。这些特性要求MLLM具备高水平的细节感知能力。虽然增加输入分辨率可以提升细节感知,但也会导致视觉令牌序列变长,从而增加计算成本并考验模型处理长上下文的能力。为应对这些挑战,我们提出了DocKylin,这是一个以文档为中心的MLLM,它在像素和令牌两个层面执行视觉内容瘦身,从而减少VDU场景下的令牌序列长度。DocKylin利用一个自适应像素瘦身(APS)预处理模块执行像素级瘦身,以提高信息像素的比例。此外,DocKylin还引入了一个新颖的动态令牌瘦身(DTS)模块来执行令牌级瘦身,筛选出关键令牌并移除其他令牌,从而创建一个压缩的、自适应的视觉序列。实验证明,DocKylin在各种VDU基准测试中均表现出色。值得注意的是,所提出的APS和DTS模块均无需额外参数,便于轻松集成到现有的MLLM中,并且我们的实验表明它们具有更广泛应用的潜力。