Tracking Persons-of-Interest via Unsupervised Representation Adaptation

Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Existing multi-target tracking methods often use low-level features which are not sufficiently discriminative for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face representations using convolutional neural networks (CNNs). Unlike existing CNN-based approaches which are only trained on large-scale face image datasets offline, we use the contextual constraints to generate a large number of training samples for a given video, and further adapt the pre-trained face CNN to specific videos using discovered training samples. Using these training samples, we optimize the embedding space so that the Euclidean distances correspond to a measure of semantic face similarity via minimizing a triplet loss function. With the learned discriminative features, we apply the hierarchical clustering algorithm to link tracklets across multiple shots to generate trajectories. We extensively evaluate the proposed algorithm on two sets of TV sitcoms and YouTube music videos, analyze the contribution of each component, and demonstrate significant performance improvement over existing techniques.

翻译：在未受限制的视频中,多面跟踪是一个具有挑战性的问题,因为一个人的面孔在多个镜头中往往由于规模、面貌、表达、光化和化妆方面的差异而显得差异很大,因此在多重镜头中,一个人的面孔往往看起来大不相同。现有的多目标跟踪方法往往使用低层次的特征,这些特征在辨别面孔时不够有区别性,因此无法用如此大的外观变异来辨别。在本文中,我们通过使用共生神经神经网络(CNNs)学习有区别的、视像化的面孔描述来解决这个问题。与现有的以CNN为基础的方法不同,这些方法仅接受过大规模脸部图像离线培训,我们使用背景限制来为特定视频制作大量培训样本,并进一步利用所发现的培训前的有线网对特定视频进行修改。我们利用这些培训样本,优化嵌入空间,使Eucloidean距离通过尽量减少三重损失功能来与语系相近。我们使用等级组合算法将轨迹连接到多个镜头,以产生轨迹。我们广泛评价了两套电视静坐和YouTube音乐组件的拟议算法,并分析每个功能的改进。