Multimodal Large Language Models (MLLMs) have recently gained immense popularity. Powerful commercial models like ChatGPT-4V and Gemini, as well as open-source ones such as LLaVA, are essentially general-purpose models and are applied to solve a wide variety of tasks, including those in computer vision. These neural networks possess such strong general knowledge and reasoning abilities that they have proven capable of working even on tasks for which they were not specifically trained. We compared the capabilities of the most powerful MLLMs to date: ShareGPT4V, ChatGPT, LLaVA-Next in a specialized task of age and gender estimation with our state-of-the-art specialized model, MiVOLO. We also updated MiVOLO and provide details and new metrics in this article. This comparison has yielded some interesting results and insights about the strengths and weaknesses of the participating models. Furthermore, we attempted various ways to fine-tune the ShareGPT4V model for this specific task, aiming to achieve state-of-the-art results in this particular challenge. Although such a model would not be practical in production, as it is incredibly expensive compared to a specialized model like MiVOLO, it could be very useful in some tasks, like data annotation.
翻译:近年来,多模态大语言模型(MLLMs)获得了极大的关注。强大的商业模型如ChatGPT-4V和Gemini,以及开源模型如LLaVA,本质上是通用模型,被广泛应用于解决各种任务,包括计算机视觉领域的任务。这些神经网络具备极强的通用知识与推理能力,已被证明能够处理那些它们并未专门训练过的任务。我们将目前最强大的MLLMs——ShareGPT4V、ChatGPT、LLaVA-Next——与我们最先进的专用模型MiVOLO,在年龄与性别估计这一专门任务上进行了能力比较。本文也更新了MiVOLO,并提供了相关细节和新的评估指标。此次比较得出了一些有趣的结果,揭示了参与模型的优势与不足。此外,我们尝试了多种方法对ShareGPT4V模型进行针对此特定任务的微调,旨在该特定挑战中达到最先进的性能。尽管这样的模型在生产中并不实用,因为与MiVOLO这类专用模型相比其成本极其高昂,但在某些任务(如数据标注)中可能非常有用。