In set-based face recognition, we aim to compute the most discriminative descriptor from an unbounded set of images and videos showing a single person. A discriminative descriptor balances two policies when aggregating information from a given set. The first is a quality-based policy: emphasizing high-quality and down-weighting low-quality images. The second is a diversity-based policy: emphasizing unique images in the set and down-weighting multiple occurrences of similar images as found in video clips which can overwhelm the set representation. This work frames face-set representation as a differentiable coreset selection problem. Our model learns how to select a small coreset of the input set that balances quality and diversity policies using a learned metric parameterized by the face quality, optimized end-to-end. The selection process is a differentiable farthest-point sampling (FPS) realized by approximating the non-differentiable Argmax operation with differentiable sampling from the Gumbel-Softmax distribution of distances. The small coreset is later used as queries in a self and cross-attention architecture to enrich the descriptor with information from the whole set. Our model is order-invariant and linear in the input set size. We set a new SOTA to set face verification on the IJB-B and IJB-C datasets. Our code is publicly available.
翻译:在基于集合的人脸识别任务中,我们旨在从展示同一人物的无界图像与视频集合中计算最具判别力的描述符。一个判别性描述符需在聚合给定集合信息时平衡两种策略:一是基于质量的策略,即强调高质量图像并降低低质量图像的权重;二是基于多样性的策略,即强调集合中独特的图像,并降低视频片段中多次出现的相似图像(可能淹没集合表示)的权重。本文将人脸集合表示建模为可微分核心集选择问题。我们的模型学习如何从输入集合中选择一个既能平衡质量与多样性策略的小型核心集,该过程通过一个由人脸质量参数化、端到端优化的学习度量实现。选择过程采用可微分最远点采样(FPS),通过利用Gumbel-Softmax距离分布的可微分采样近似不可微的Argmax操作实现。后续将此小型核心集作为自注意力与交叉注意力架构中的查询,以从整个集合中提取信息丰富描述符。本模型具有顺序不变性,且计算复杂度与输入集合规模呈线性关系。我们在IJB-B和IJB-C数据集上建立了面部集合验证的最新最优结果。我们的代码已公开。