Towards Zero-Shot Interpretable Human Recognition: A 2D-3D Registration Framework

Large vision models based in deep learning architectures have been consistently advancing the state-of-the-art in biometric recognition. However, three weaknesses are commonly reported for such kind of approaches: 1) their extreme demands in terms of learning data; 2) the difficulties in generalising between different domains; and 3) the lack of interpretability/explainability, with biometrics being of particular interest, as it is important to provide evidence able to be used for forensics/legal purposes (e.g., in courts). To the best of our knowledge, this paper describes the first recognition framework/strategy that aims at addressing the three weaknesses simultaneously. At first, it relies exclusively in synthetic samples for learning purposes. Instead of requiring a large amount and variety of samples for each subject, the idea is to exclusively enroll a 3D point cloud per identity. Then, using generative strategies, we synthesize a very large (potentially infinite) number of samples, containing all the desired covariates (poses, clothing, distances, perspectives, lighting, occlusions,...). Upon the synthesizing method used, it is possible to adapt precisely to different kind of domains, which accounts for generalization purposes. Such data are then used to learn a model that performs local registration between image pairs, establishing positive correspondences between body parts that are the key, not only to recognition (according to cardinality and distribution), but also to provide an interpretable description of the response (e.g.: "both samples are from the same person, as they have similar facial shape, hair color and legs thickness").

翻译：基于深度学习架构的大型视觉模型持续推动着生物特征识别领域的最新技术水平。然而，此类方法普遍存在三个弱点：1）对学习数据的极端需求；2）跨领域泛化的困难；3）缺乏可解释性/可说明性——这在生物特征识别领域尤为重要，因为需提供可用于法医/法律目的（例如法庭举证）的证据。据我们所知，本文首次提出一种同时解决上述三个弱点的识别框架/策略。首先，该方法完全依赖合成样本进行学习。其核心理念并非为每个主体收集大量多样化的样本，而是仅为每个身份注册一个三维点云。随后，利用生成策略合成数量极大（理论上无穷）的样本，其中包含所有期望的协变量（姿态、衣物、距离、视角、光照、遮挡等）。通过所采用的合成方法，可精确适配不同类型的领域，从而实现泛化能力。此类数据被用于训练模型，该模型执行图像对间的局部配准，建立身体部位间的正对应关系——这些对应关系不仅是识别的关键（根据基数和分布），同时还能提供响应的可解释描述（例如：“两个样本来自同一人，因其面部形状相似、发色相同且腿部粗细一致”）。