Age and gender recognition in the wild is a highly challenging task: apart from the variability of conditions, pose complexities, and varying image quality, there are cases where the face is partially or completely occluded. We present MiVOLO (Multi Input VOLO), a straightforward approach for age and gender estimation using the latest vision transformer. Our method integrates both tasks into a unified dual input/output model, leveraging not only facial information but also person image data. This improves the generalization ability of our model and enables it to deliver satisfactory results even when the face is not visible in the image. To evaluate our proposed model, we conduct experiments on four popular benchmarks and achieve state-of-the-art performance, while demonstrating real-time processing capabilities. Additionally, we introduce a novel benchmark based on images from the Open Images Dataset. The ground truth annotations for this benchmark have been meticulously generated by human annotators, resulting in high accuracy answers due to the smart aggregation of votes. Furthermore, we compare our model's age recognition performance with human-level accuracy and demonstrate that it significantly outperforms humans across a majority of age ranges. Finally, we grant public access to our models, along with the code for validation and inference. In addition, we provide extra annotations for used datasets and introduce our new benchmark.
翻译:野外环境下的年龄与性别识别是一项极具挑战性的任务:除了条件多样性、姿态复杂性以及图像质量差异外,还存在面部部分或完全遮挡的情况。我们提出MiVOLO(多输入VOLO),一种利用最新视觉Transformer实现年龄与性别估计的简洁方法。该方法将两项任务整合为统一的双输入/输出模型,不仅利用面部信息,还结合人物全身图像数据。这提升了模型的泛化能力,使其即使在面部不可见的情况下也能输出令人满意的结果。为评估所提模型,我们在四个主流基准数据集上开展实验,在实现实时处理能力的同时取得了最先进的性能。此外,我们基于Open Images数据集图像引入了一个新基准。该基准的真实标注由人工标注员精心生成,通过智能投票聚合机制得出高精度结果。进一步地,我们将模型年龄识别性能与人类水平精度对比,证明其在多数年龄段上显著超越人类。最后,我们开放了模型及验证与推理代码,并额外提供了所用数据集的补充标注及新基准数据集。