We consider a scenario where we have access to the target domain, but cannot afford on-the-fly training data annotation, and instead would like to construct an alternative training set from a large-scale data pool such that a competitive model can be obtained. We propose a search and pruning (SnP) solution to this training data search problem, tailored to object re-identification (re-ID), an application aiming to match the same object captured by different cameras. Specifically, the search stage identifies and merges clusters of source identities which exhibit similar distributions with the target domain. The second stage, subject to a budget, then selects identities and their images from the Stage I output, to control the size of the resulting training set for efficient training. The two steps provide us with training sets 80\% smaller than the source pool while achieving a similar or even higher re-ID accuracy. These training sets are also shown to be superior to a few existing search methods such as random sampling and greedy sampling under the same budget on training data size. If we release the budget, training sets resulting from the first stage alone allow even higher re-ID accuracy. We provide interesting discussions on the specificity of our method to the re-ID problem and particularly its role in bridging the re-ID domain gap. The code is available at https://github.com/yorkeyao/SnP.
翻译:我们考虑这样一种场景:可以访问目标领域,但无法进行实时的训练数据标注,而是希望从大规模数据池中构建替代性训练集,以获得具有竞争力的模型。针对这一训练数据搜索问题,我们提出了一种面向目标重识别(re-ID)的搜索与剪枝(SnP)解决方案。具体而言,搜索阶段识别并合并与目标领域具有相似分布的源身份聚类;第二阶段在预算约束下,从第一阶段输出中选择身份及其图像,以控制最终训练集规模,实现高效训练。这两步使得我们获得的训练集比原始数据池缩小80%,同时保持甚至提升重识别精度。在训练数据规模预算相同的条件下,该训练集也优于随机采样、贪心采样等现有搜索方法。若放宽预算限制,仅通过第一阶段生成的训练集即可获得更高重识别精度。我们深入讨论了该方法对重识别问题的特异性,尤其是其在弥合重识别领域差异中的关键作用。代码开源在 https://github.com/yorkeyao/SnP。