Face Image Quality Assessment (FIQA) is essential for reliable face recognition systems. Current approaches primarily exploit only final-layer representations, while training-free methods require multiple forward passes or backpropagation. We propose ViTNT-FIQA, a training-free approach that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. We demonstrate that high-quality face images exhibit stable feature refinement trajectories across blocks, while degraded images show erratic transformations. Our method computes Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks and aggregates them into image-level quality scores. We empirically validate this correlation on a quality-labeled synthetic dataset with controlled degradation levels. Unlike existing training-free approaches, ViTNT-FIQA requires only a single forward pass without backpropagation or architectural modifications. Through extensive evaluation on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C), we show that ViTNT-FIQA achieves competitive performance with state-of-the-art methods while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.
翻译:人脸图像质量评估(FIQA)对于可靠的人脸识别系统至关重要。当前方法主要仅利用最终层表示,而无训练方法则需要多次前向传播或反向传播。我们提出ViTNT-FIQA,一种无训练方法,通过测量中间视觉Transformer(ViT)块之间图像块嵌入演化的稳定性来评估质量。我们证明,高质量人脸图像在跨块间表现出稳定的特征优化轨迹,而质量退化的图像则呈现不稳定的变换。我们的方法计算连续Transformer块间L2归一化图像块嵌入的欧几里得距离,并将其聚合为图像级质量分数。我们在一个具有可控退化等级的质量标注合成数据集上实证验证了这种相关性。与现有的无训练方法不同,ViTNT-FIQA仅需单次前向传播,无需反向传播或架构修改。通过在八个基准数据集(LFW、AgeDB-30、CFP-FP、CALFW、Adience、CPLFW、XQLFW、IJB-C)上的广泛评估,我们表明ViTNT-FIQA在保持计算高效性和对任何预训练ViT人脸识别模型即时适用性的同时,实现了与最先进方法相竞争的性能。