Facial age estimation plays a critical role in content moderation, age verification, and deepfake detection. However, no prior benchmark has systematically compared modern vision-language models (VLMs) with specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating 34 models - 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs - across eight standard datasets (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, and AgeDB), totaling 1,100 test images per model. Our key finding is striking: zero-shot VLMs significantly outperform most specialized models, achieving an average mean absolute error (MAE) of 5.65 years compared to 9.88 years for non-LLM models. The best-performing VLM (Gemini 3 Flash Preview, MAE 4.32) surpasses the strongest non-LLM model (MiVOLO, MAE 5.10) by 15%. MiVOLO - unique in combining face and body features using Vision Transformers - is the only specialized model that remains competitive with VLMs. We further analyze age verification at the 18-year threshold and find that most non-LLM models exhibit false adult rates between 39% and 100% for minors, whereas VLMs reduce this to 16%-29%. Additionally, coarse age binning (8-9 classes) consistently increases MAE beyond 13 years. Stratified analysis across 14 age groups reveals that all models struggle most at extreme ages (under 5 and over 65). Overall, these findings challenge the assumption that task-specific architectures are necessary for high-performance age estimation and suggest that future work should focus on distilling VLM capabilities into efficient specialized models.
翻译:面部年龄估计在内容审核、年龄验证和深度伪造检测中起着关键作用。然而,尚无先前的基准研究系统性地比较现代视觉语言模型(VLMs)与专用年龄估计架构。我们提出了首个大规模跨范式基准,评估了34个模型——包括22个具有公开可用预训练权重的专用架构和12个通用VLMs——在八个标准数据集(UTKFace、IMDB-WIKI、MORPH、AFAD、CACD、FG-NET、APPA-REAL和AgeDB)上的表现,每个模型总计测试1,100张图像。我们的关键发现引人注目:零样本VLMs显著优于大多数专用模型,其平均绝对误差(MAE)为5.65岁,而非LLM模型的MAE为9.88岁。表现最佳的VLM(Gemini 3 Flash Preview,MAE 4.32)超越了最强的非LLM模型(MiVOLO,MAE 5.10)15%。MiVOLO——唯一通过Vision Transformers结合面部和身体特征的专用模型——是唯一能与VLMs竞争的专用模型。我们进一步分析了18岁阈值的年龄验证,发现大多数非LLM模型对未成年人的误判为成人的比率在39%至100%之间,而VLMs将此比率降低至16%-29%。此外,粗粒度年龄分箱(8-9个类别)持续导致MAE超过13岁。对14个年龄组的分层分析表明,所有模型在极端年龄(5岁以下和65岁以上)表现最差。总体而言,这些发现挑战了任务专用架构是实现高性能年龄估计所必需的假设,并表明未来的工作应侧重于将VLM能力蒸馏到高效的专用模型中。