Recently, due to the advancement of multimodal technology, people are attempting to use visual large language models (VLLMs) in industrial production. Many deep learning models (DLMs) deployed in the production environment are gradually being replaced by VLLMs. Compared with DLMs, VLLMs have some advantages in industrial applications: (1) Their strong generalization ability enables them to perform well across a wide range of tasks. (2) They are flexible and can deal with unfamiliar samples through context learning quickly. However, VLLMs also have obvious drawbacks: (1) VLLMs do not perform as well as custom-developed DLMs in specific domains. (2) The number of parameters in VLLMs is generally quite large, and their deployment requires substantial computational resources. (3) VLLMs generally operate much slower than DLMs, making real-time response challenging to achieve. To better utilize VLLMs in industrial applications, we introduce AuroraEdge-V-2B in this work, a compact, robust, and high-speed VLLM designed for edge deployment. To make the model run faster, we also propose a compression-fusion method to improve inference efficiency. AuroraEdge-V-2B has the following notable features: (1) Easy deployment and faster: It has only 2B parameters and is highly suitable for edge deployment, offering better real-time performance. (2) Fewer visual tokens and cheaper: It significantly reduces the number of visual tokens in the decoding process, thereby reducing the floating-point operations by half during inference and making it cheaper to use. (3) Strong performance: It gets a higher score on 9 benchmarks than models with the same number of parameter (e.g., Qwen2-VL-2B, Qwen2.5-VL-3B, InternVL-2.5-2B).
翻译:近年来,得益于多模态技术的进步,人们尝试在工业生产中应用视觉大语言模型。许多部署在生产环境中的深度学习模型正逐渐被VLLMs所取代。与DLMs相比,VLLMs在工业应用中具有一些优势:(1)其强大的泛化能力使其能够在广泛的任务中表现出色。(2)它们灵活性强,能够通过上下文学习快速处理陌生样本。然而,VLLMs也存在明显的缺点:(1)在特定领域,VLLMs的表现不如定制开发的DLMs。(2)VLLMs的参数数量通常非常庞大,其部署需要大量的计算资源。(3)VLLMs的运行速度通常远慢于DLMs,使得实时响应难以实现。为了更好地在工业应用中利用VLLMs,我们在本工作中介绍了AuroraEdge-V-2B,这是一个为边缘部署设计的紧凑、鲁棒且高速的VLLM。为了使模型运行得更快,我们还提出了一种压缩融合方法来提高推理效率。AuroraEdge-V-2B具有以下显著特点:(1)易于部署且更快:它仅有20亿参数,非常适合边缘部署,提供更好的实时性能。(2)视觉令牌更少、成本更低:它显著减少了解码过程中的视觉令牌数量,从而将推理期间的浮点运算量减少一半,降低了使用成本。(3)性能强大:在9个基准测试中,它获得了比参数量相当的模型(例如Qwen2-VL-2B、Qwen2.5-VL-3B、InternVL-2.5-2B)更高的分数。