HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

High-resolution inputs enable Large Vision-Language Models (LVLMs) to discern finer visual details, enhancing their comprehension capabilities. To reduce the training and computation costs caused by high-resolution input, one promising direction is to use sliding windows to slice the input into uniform patches, each matching the input size of the well-trained vision encoder. Although efficient, this slicing strategy leads to the fragmentation of original input, i.e., the continuity of contextual information and spatial geometry is lost across patches, adversely affecting performance in cross-patch context perception and position-specific tasks. To overcome these shortcomings, we introduce HiRes-LLaVA, a novel framework designed to efficiently process any size of high-resolution input without altering the original contextual and geometric information. HiRes-LLaVA comprises two innovative components: (i) a SliceRestore adapter that reconstructs sliced patches into their original form, efficiently extracting both global and local features via down-up-sampling and convolution layers, and (ii) a Self-Mining Sampler to compresses the vision tokens based on themselves, preserving the original context and positional information while reducing training overhead. To assess the ability of handling context fragmentation, we construct a new benchmark, EntityGrid-QA, consisting of edge-related and position-related tasks. Our comprehensive experiments demonstrate the superiority of HiRes-LLaVA on both existing public benchmarks and on EntityGrid-QA, particularly on document-oriented tasks, establishing new standards for handling high-resolution inputs.

翻译：高分辨率输入使得大型视觉语言模型能够辨别更精细的视觉细节，从而提升其理解能力。为降低高分辨率输入带来的训练与计算成本，一种有前景的方向是采用滑动窗口将输入切分为均匀的图像块，使每块尺寸与训练完备的视觉编码器的输入尺寸相匹配。尽管这一切片策略效率较高，但它会导致原始输入的碎片化，即上下文信息的连续性与空间几何结构在图像块间丢失，从而对跨图像块的上下文感知及位置相关任务的性能产生不利影响。为克服这些缺陷，我们提出了HiRes-LLaVA——一种新颖的框架，旨在高效处理任意尺寸的高分辨率输入，同时不改变原始的上下文与几何信息。HiRes-LLaVA包含两个创新组件：（i）SliceRestore适配器，其通过下采样-上采样与卷积层将切片后的图像块重建至原始形态，高效提取全局与局部特征；（ii）Self-Mining采样器，基于视觉标记自身进行压缩，在降低训练开销的同时保留原始上下文与位置信息。为评估处理上下文碎片化的能力，我们构建了一个新基准测试EntityGrid-QA，其中包含与边缘相关及与位置相关的任务。我们全面的实验表明，HiRes-LLaVA在现有公共基准测试及EntityGrid-QA上均表现出优越性，尤其在面向文档的任务中，为处理高分辨率输入树立了新的标准。