Vision transformers in vision-language models typically use the same amount of compute for every image, regardless of whether it is simple or complex. We propose ICAR (Image Complexity-Aware Retrieval), an adaptive computation approach that enables vision transformers to use less compute for simple images whilst processing complex images through their full network depth. The key challenge is maintaining cross-modal alignment: embeddings from different processing depths must remain compatible for text matching. ICAR solves this through dual-path training that produces compatible embeddings from both the early-exit and full-depth paths. This maintains compatibility between image representations and text embeddings in the same semantic space, whether an image exits early or processes fully. Unlike existing two-stage approaches that require expensive reranking, ICAR enables direct image-text matching without additional overhead. To determine how much compute to use, we develop ConvNeXt-IC, which treats image complexity assessment as a classification task. By applying modern classifier backbones rather than specialised architectures, ConvNeXt-IC achieves state-of-the-art performance, attaining a Pearson correlation coefficient of 0.959 with human labelling whilst delivering 4.4x faster complexity prediction. Evaluated on standard benchmarks augmented with real-world web data, ICAR achieves 20% faster image encoding while maintaining category-level performance and 95% of instance-level performance, enabling sustainable scaling of vision-language systems.
翻译:视觉语言模型中的视觉Transformer通常对每张图像使用相同的计算量,无论其简单或复杂。我们提出ICAR(图像复杂度感知检索),一种自适应计算方法,使视觉Transformer能够对简单图像使用更少的计算,同时对复杂图像执行完整的网络深度处理。核心挑战在于保持跨模态对齐:不同处理深度产生的嵌入必须保持文本匹配的兼容性。ICAR通过双路径训练解决这一问题,该训练从早期退出路径和完整深度路径生成兼容的嵌入。这确保了无论图像是早期退出还是完整处理,其表征与文本嵌入都能在同一语义空间中保持兼容。与现有需要昂贵重排序的两阶段方法不同,ICAR无需额外开销即可实现直接的图文匹配。为确定计算量需求,我们开发了ConvNeXt-IC,将图像复杂度评估构建为分类任务。通过采用现代分类器主干网络而非专用架构,ConvNeXt-IC实现了最先进的性能,其复杂度预测与人工标注的皮尔逊相关系数达到0.959,同时预测速度提升4.4倍。在标准基准测试集(补充真实网络数据)上的评估表明,ICAR在保持类别级性能及95%实例级性能的同时,实现了20%的图像编码加速,为视觉语言系统的可持续扩展提供了可行方案。