Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender Estimation

Multimodal Large Language Models (MLLMs) have recently gained immense popularity. Powerful commercial models like ChatGPT-4V and Gemini, as well as open-source ones such as LLaVA, are essentially general-purpose models and are applied to solve a wide variety of tasks, including those in computer vision. These neural networks possess such strong general knowledge and reasoning abilities that they have proven capable of working even on tasks for which they were not specifically trained. We compared the capabilities of the most powerful MLLMs to date: ShareGPT4V, ChatGPT, LLaVA-Next in a specialized task of age and gender estimation with our state-of-the-art specialized model, MiVOLO. We also updated MiVOLO and provide details and new metrics in this article. This comparison has yielded some interesting results and insights about the strengths and weaknesses of the participating models. Furthermore, we attempted various ways to fine-tune the ShareGPT4V model for this specific task, aiming to achieve state-of-the-art results in this particular challenge. Although such a model would not be practical in production, as it is incredibly expensive compared to a specialized model like MiVOLO, it could be very useful in some tasks, like data annotation.

翻译：近年来，多模态大语言模型（MLLMs）获得了极大的关注。强大的商业模型如ChatGPT-4V和Gemini，以及开源模型如LLaVA，本质上是通用模型，被广泛应用于解决各种任务，包括计算机视觉领域的任务。这些神经网络具备极强的通用知识与推理能力，已被证明能够处理那些它们并未专门训练过的任务。我们将目前最强大的MLLMs——ShareGPT4V、ChatGPT、LLaVA-Next——与我们最先进的专用模型MiVOLO，在年龄与性别估计这一专门任务上进行了能力比较。本文也更新了MiVOLO，并提供了相关细节和新的评估指标。此次比较得出了一些有趣的结果，揭示了参与模型的优势与不足。此外，我们尝试了多种方法对ShareGPT4V模型进行针对此特定任务的微调，旨在该特定挑战中达到最先进的性能。尽管这样的模型在生产中并不实用，因为与MiVOLO这类专用模型相比其成本极其高昂，但在某些任务（如数据标注）中可能非常有用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日