Age and gender recognition in the wild is a highly challenging task: apart from the variability of conditions, pose complexities, and varying image quality, there are cases where the face is partially or completely occluded. We present MiVOLO (Multi Input VOLO), a straightforward approach for age and gender estimation using the latest vision transformer. Our method integrates both tasks into a unified dual input/output model, leveraging not only facial information but also person image data. This improves the generalization ability of our model and enables it to deliver satisfactory results even when the face is not visible in the image. To evaluate our proposed model, we conduct experiments on four popular benchmarks and achieve state-of-the-art performance, while demonstrating real-time processing capabilities. Additionally, we introduce a novel benchmark based on images from the Open Images Dataset. The ground truth annotations for this benchmark have been meticulously generated by human annotators, resulting in high accuracy answers due to the smart aggregation of votes. Furthermore, we compare our model's age recognition performance with human-level accuracy and demonstrate that it significantly outperforms humans across a majority of age ranges. Finally, we grant public access to our models, along with the code for validation and inference. In addition, we provide extra annotations for used datasets and introduce our new benchmark.
翻译:野外环境下的年龄与性别识别是一项极具挑战性的任务:除了条件多变、姿态复杂以及图像质量参差不齐外,还存在面部部分或完全遮挡的情况。我们提出了一种名为MiVOLO(多输入VOLO)的简洁方法,利用最新的视觉Transformer进行年龄与性别估计。该方法将两项任务整合为统一的双输入/输出模型,不仅利用面部信息,还融合了人物图像数据。这提升了模型的泛化能力,使其即使在面部不可见时也能输出令人满意的结果。为评估所提模型,我们在四个主流基准数据集上进行了实验,在实现实时处理能力的同时达到了最先进的性能。此外,我们基于Open Images数据集中的图像引入了一个新基准。该基准的真值标注由人类标注员精心生成,并通过智能投票聚合策略确保了高准确率。进一步地,我们将模型的年龄识别性能与人类水平进行比较,证明其在大多数年龄段上显著优于人类。最后,我们公开了模型、验证及推理代码,并提供了所用数据集的额外标注以及新基准数据。