Towards Zero-Shot Interpretable Human Recognition: A 2D-3D Registration Framework

Large vision models based in deep learning architectures have been consistently advancing the state-of-the-art in biometric recognition. However, three weaknesses are commonly reported for such kind of approaches: 1) their extreme demands in terms of learning data; 2) the difficulties in generalising between different domains; and 3) the lack of interpretability/explainability, with biometrics being of particular interest, as it is important to provide evidence able to be used for forensics/legal purposes (e.g., in courts). To the best of our knowledge, this paper describes the first recognition framework/strategy that aims at addressing the three weaknesses simultaneously. At first, it relies exclusively in synthetic samples for learning purposes. Instead of requiring a large amount and variety of samples for each subject, the idea is to exclusively enroll a 3D point cloud per identity. Then, using generative strategies, we synthesize a very large (potentially infinite) number of samples, containing all the desired covariates (poses, clothing, distances, perspectives, lighting, occlusions,...). Upon the synthesizing method used, it is possible to adapt precisely to different kind of domains, which accounts for generalization purposes. Such data are then used to learn a model that performs local registration between image pairs, establishing positive correspondences between body parts that are the key, not only to recognition (according to cardinality and distribution), but also to provide an interpretable description of the response (e.g.: "both samples are from the same person, as they have similar facial shape, hair color and legs thickness").

翻译：基于深度学习架构的大规模视觉模型持续推动着生物特征识别领域的技术前沿。然而，此类方法普遍存在三个弱点：1）对学习数据的极端需求；2）跨领域泛化的困难；3）缺乏可解释性/可说明性——这在生物识别领域尤为重要，因为需要提供可用于司法/法律目的的证据（例如在法庭上）。据我们所知，本文首次提出了旨在同时解决这三个弱点的识别框架/策略。该框架首先完全依赖合成样本进行学习：无需为每个目标对象采集大量多样化样本，而是仅需为每个身份注册一个3D点云。随后，我们通过生成式策略合成数量极大（理论上无限）的样本，涵盖所有目标协变量（姿态、衣着、距离、视角、光照、遮挡等）。所使用的合成方法能够精确适配不同领域，从而实现泛化目标。这些数据随后用于训练一个执行图像对局部配准的模型，通过建立身体部位间的正向对应关系——这不仅是通过对应点数量与分布实现识别的关键，还能为识别结果提供可解释的描述（例如：“两个样本来自同一人，因其具有相似的面部轮廓、发色与腿部粗细”）。