Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

翻译：基于面部图像的即用型年龄估计：视觉语言模型与即用型传统架构的综合基准对比

Simiao Ren,Xingyu Shen,Ankit Raj,Albert Dai, Caroline, Zhang,Yuan Xu,Zexi Chen,Siqi Wu,Chen Gong,Yuxin Zhang

Facial age estimation plays a critical role in content moderation, age verification, and deepfake detection. However, no prior benchmark has systematically compared modern vision-language models (VLMs) with specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating 34 models - 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs - across eight standard datasets (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, and AgeDB), totaling 1,100 test images per model. Our key finding is striking: zero-shot VLMs significantly outperform most specialized models, achieving an average mean absolute error (MAE) of 5.65 years compared to 9.88 years for non-LLM models. The best-performing VLM (Gemini 3 Flash Preview, MAE 4.32) surpasses the strongest non-LLM model (MiVOLO, MAE 5.10) by 15%. MiVOLO - unique in combining face and body features using Vision Transformers - is the only specialized model that remains competitive with VLMs. We further analyze age verification at the 18-year threshold and find that most non-LLM models exhibit false adult rates between 39% and 100% for minors, whereas VLMs reduce this to 16%-29%. Additionally, coarse age binning (8-9 classes) consistently increases MAE beyond 13 years. Stratified analysis across 14 age groups reveals that all models struggle most at extreme ages (under 5 and over 65). Overall, these findings challenge the assumption that task-specific architectures are necessary for high-performance age estimation and suggest that future work should focus on distilling VLM capabilities into efficient specialized models.

翻译：面部年龄估计在内容审核、年龄验证和深度伪造检测中起着关键作用。然而，尚无先前的基准研究系统性地比较现代视觉语言模型（VLMs）与专用年龄估计架构。我们提出了首个大规模跨范式基准，评估了34个模型——包括22个具有公开可用预训练权重的专用架构和12个通用VLMs——在八个标准数据集（UTKFace、IMDB-WIKI、MORPH、AFAD、CACD、FG-NET、APPA-REAL和AgeDB）上的表现，每个模型总计测试1,100张图像。我们的关键发现引人注目：零样本VLMs显著优于大多数专用模型，其平均绝对误差（MAE）为5.65岁，而非LLM模型的MAE为9.88岁。表现最佳的VLM（Gemini 3 Flash Preview，MAE 4.32）超越了最强的非LLM模型（MiVOLO，MAE 5.10）15%。MiVOLO——唯一通过Vision Transformers结合面部和身体特征的专用模型——是唯一能与VLMs竞争的专用模型。我们进一步分析了18岁阈值的年龄验证，发现大多数非LLM模型对未成年人的误判为成人的比率在39%至100%之间，而VLMs将此比率降低至16%-29%。此外，粗粒度年龄分箱（8-9个类别）持续导致MAE超过13岁。对14个年龄组的分层分析表明，所有模型在极端年龄（5岁以下和65岁以上）表现最差。总体而言，这些发现挑战了任务专用架构是实现高性能年龄估计所必需的假设，并表明未来的工作应侧重于将VLM能力蒸馏到高效的专用模型中。