ViT-FREE: Efficient Face Recognition via Early Exiting and Synthetic Adaptation

Vision Transformers (ViTs) have gained significant attention in computer vision and shown strong potential for face recognition (FR). However, their high computational cost makes deployment on resource-constrained devices challenging, motivating the need for methods that balance efficiency and accuracy. In this work, we investigate early exiting in pretrained ViTs as a simple yet effective training-free strategy for efficient FR inference. Leveraging the uniform feature dimensionality across transformer encoder blocks, we introduce ViT-FREE, a multi-exit framework that enables face verification directly from intermediate representations without modifying or retraining the backbone model, and thus, reducing inference cost. Empirically, we show that patch embeddings and attention maps evolve progressively across depth, exhibiting high similarity between consecutive ViT blocks and increasing alignment with the final representation. This indicates gradual feature refinement and attention convergence, suggesting that intermediate layers already provide stable and discriminative representations suitable for early exiting. Through extensive experiments on multiple FR benchmarks, we systematically analyze the accuracy-efficiency trade-off across exit depths. Our results demonstrate that later exits achieve a highly favorable balance, with exiting at layer 10 yielding up to a 20% speedup while incurring only a 1.5 drop in verification performance on benchmarks such as IJB-C. Also, we propose ViT-FREE_FT, a lightweight exit-specific fine-tuning strategy that adapts only the projection layers using a small synthetic dataset while keeping the transformer backbone frozen. This approach improves the performance of shallow exits while preserving the efficiency benefits and leaving deeper exits largely unaffected.

翻译：视觉Transformer（ViTs）在计算机视觉领域引起了广泛关注，并在人脸识别（FR）中展现出强劲潜力。然而，其高计算成本给在资源受限设备上的部署带来挑战，这促使人们探索兼顾效率与准确性的方法。本研究探究了在预训练ViT中采用早期退出策略，作为一种简单且无需训练的推理加速方案，用于高效人脸识别。利用Transformer编码器各模块间特征维度统一的特点，我们提出ViT-FREE——一种多出口框架，可直接从中间层表征实现人脸验证，无需修改或重新训练骨干模型，从而降低推理成本。实验表明，补丁嵌入与注意力图随深度渐进演化，相邻ViT模块间呈现高度相似性，且与最终表征的对齐程度逐步增强。这表明特征逐步精炼与注意力收敛，揭示了中间层已具备稳定且具有判别力的表征，适用于早期退出。通过在多个FR基准上的广泛实验，我们系统分析了不同退出深度下的精度-效率权衡。结果表明，较晚退出可达成高度有利的平衡：在IJB-C等基准上，第10层退出可实现高达20%的加速，同时验证性能仅下降1.5个百分点。此外，我们提出轻量化的出口特定微调策略ViT-FREE_FT，该方法仅利用小型合成数据集适配投影层，同时保持Transformer骨干网络冻结。该策略在保留效率优势的前提下提升了浅层出口的性能，且对深层出口几乎无影响。