In this paper, we introduce a framework ARBEx, a novel attentive feature extraction framework driven by Vision Transformer with reliability balancing to cope against poor class distributions, bias, and uncertainty in the facial expression learning (FEL) task. We reinforce several data pre-processing and refinement methods along with a window-based cross-attention ViT to squeeze the best of the data. We also employ learnable anchor points in the embedding space with label distributions and multi-head self-attention mechanism to optimize performance against weak predictions with reliability balancing, which is a strategy that leverages anchor points, attention scores, and confidence values to enhance the resilience of label predictions. To ensure correct label classification and improve the models' discriminative power, we introduce anchor loss, which encourages large margins between anchor points. Additionally, the multi-head self-attention mechanism, which is also trainable, plays an integral role in identifying accurate labels. This approach provides critical elements for improving the reliability of predictions and has a substantial positive effect on final prediction capabilities. Our adaptive model can be integrated with any deep neural network to forestall challenges in various recognition tasks. Our strategy outperforms current state-of-the-art methodologies, according to extensive experiments conducted in a variety of contexts.
翻译:本文提出ARBEx框架,这是一种由Vision Transformer驱动的创新注意力特征提取框架,通过可靠性平衡机制应对人脸表情学习任务中的类别分布不均、偏差与不确定性。我们强化了多种数据预处理与精细化方法,并采用基于窗口的交叉注意力ViT以充分挖掘数据潜力。通过嵌入空间中的可学习锚点、标签分布与多头自注意力机制,结合可靠性平衡策略(该策略利用锚点、注意力分数和置信度增强标签预测的鲁棒性)优化弱预测场景下的性能。为确保正确标签分类并提升模型判别能力,我们引入锚点损失函数,促使锚点间保持较大间隔。同时,可训练的多头自注意力机制在精准标签识别中发挥核心作用。该方法为提升预测可靠性提供关键要素,并对最终预测能力产生显著正向影响。本自适应模型可集成至任意深度神经网络,以应对各类识别任务中的挑战。广泛的多场景实验表明,本策略性能优于现有最先进方法。