Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures

翻译：基于面部图像的即用型年龄估计：视觉语言模型与即用型传统架构的全面基准测试

Simiao Ren

Facial age estimation is critical for content moderation, age verification, and deepfake detection, yet no prior benchmark has systematically compared modern vision-language models (VLMs) against specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating \textbf{34 models} -- 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs -- across \textbf{8 standard datasets} (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, AgeDB) totaling 1{,}100 test images per model. Our key finding is striking: \emph{zero-shot VLMs significantly outperform most specialized models}, achieving an average MAE of 5.65 years compared to 9.88 for non-LLM models. The best VLM (Gemini~3 Flash Preview, MAE~4.32) outperforms the best non-LLM model (MiVOLO, MAE~5.10) by 15\%. Only MiVOLO, which uniquely combines face and body features via Vision Transformers, competes with VLMs. We further analyze age verification at the 18-year threshold, revealing that non-LLM models exhibit 60--100\% false adult rates on minors while VLMs achieve 13--25\%, and demonstrate that coarse age binning (8--9 classes) consistently degrades MAE beyond 13 years. Our stratified analysis across 14 age groups reveals that all models struggle most at extreme ages ($<$5 and 65+). These findings challenge the assumption that task-specific architectures are necessary for age estimation and suggest that the field should redirect toward distilling VLM capabilities into efficient specialized models.

翻译：面部年龄估计对于内容审核、年龄验证和深度伪造检测至关重要，然而此前尚无基准测试系统地比较现代视觉语言模型与专用年龄估计架构。我们提出了首个大规模跨范式基准测试，评估了 **34 个模型**——22 个具有公开可用预训练权重的专用架构和 12 个通用视觉语言模型——在 **8 个标准数据集**（UTKFace、IMDB-WIKI、MORPH、AFAD、CACD、FG-NET、APPA-REAL、AgeDB）上进行，每个模型总计 1,100 张测试图像。我们的关键发现引人注目：*零样本视觉语言模型显著优于大多数专用模型*，其平均绝对误差为 5.65 岁，而非大型语言模型模型的平均绝对误差为 9.88 岁。最佳视觉语言模型（Gemini 3 Flash Preview，平均绝对误差 4.32）以 15% 的优势优于最佳非大型语言模型模型（MiVOLO，平均绝对误差 5.10）。只有 MiVOLO——它通过 Vision Transformer 独特地结合了面部和身体特征——能与视觉语言模型竞争。我们进一步分析了 18 岁阈值的年龄验证，揭示出非大型语言模型模型在未成年人上表现出 60-100% 的误判为成人的比率，而视觉语言模型为 13-25%，并证明了粗粒度年龄分箱（8-9 类）会持续使平均绝对误差恶化至 13 岁以上。我们对 14 个年龄组的分层分析表明，所有模型在极端年龄（<5 岁和 65+ 岁）表现最差。这些发现挑战了任务专用架构对于年龄估计是必要的这一假设，并表明该领域应转向将视觉语言模型的能力提炼到高效的专用模型中。