Current large vision-language models (VLMs) often encounter challenges such as insufficient capabilities of a single visual component and excessively long visual tokens. These issues can limit the model's effectiveness in accurately interpreting complex visual information and over-lengthy contextual information. Addressing these challenges is crucial for enhancing the performance and applicability of VLMs. This paper proposes the use of ensemble experts technique to synergizes the capabilities of individual visual encoders, including those skilled in image-text matching, OCR, image segmentation, etc. This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs. In addition, we explore different positional encoding schemes to alleviate the waste of positional encoding caused by lengthy image feature sequences, effectively addressing the issue of position overflow and length limitations. For instance, in our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1. Experimental results demonstrate that VLMs with multiple experts exhibit consistently superior performance over isolated visual encoders and mark a significant performance boost as more experts are integrated. We have open-sourced the training code used in this report. All of these resources can be found on our project website.
翻译:当前大型视觉语言模型常面临单一视觉组件能力不足及视觉标记过长等挑战,这些问题限制了模型对复杂视觉信息的准确解读能力与超长上下文信息的处理效能。为提升视觉语言模型的性能与适用性,本文提出采用集成专家技术协同多个视觉编码器的能力,包括擅长图像-文本匹配、光学字符识别、图像分割等任务的编码器。该技术通过引入融合网络统一处理不同视觉专家的输出,同时弥合图像编码器与预训练大语言模型之间的鸿沟。此外,我们探索了多种位置编码方案以缓解长图像特征序列造成的位置编码浪费问题,有效应对位置溢出与长度限制。例如在我们的实现中,该技术可将SAM等模型的位置占用率从4096大幅压缩至更高效的64甚至1。实验结果表明,多专家融合的视觉语言模型性能持续优于单一视觉编码器,且随着专家数量增加呈现显著性能提升。本文已开源训练代码,所有资源可在项目网站获取。