Bipartite Mode Matching for Vision Training Set Search from a Hierarchical Data Server

We explore a situation in which the target domain is accessible, but real-time data annotation is not feasible. Instead, we would like to construct an alternative training set from a large-scale data server so that a competitive model can be obtained. For this problem, because the target domain usually exhibits distinct modes (i.e., semantic clusters representing data distribution), if the training set does not contain these target modes, the model performance would be compromised. While prior existing works improve algorithms iteratively, our research explores the often-overlooked potential of optimizing the structure of the data server. Inspired by the hierarchical nature of web search engines, we introduce a hierarchical data server, together with a bipartite mode matching algorithm (BMM) to align source and target modes. For each target mode, we look in the server data tree for the best mode match, which might be large or small in size. Through bipartite matching, we aim for all target modes to be optimally matched with source modes in a one-on-one fashion. Compared with existing training set search algorithms, we show that the matched server modes constitute training sets that have consistently smaller domain gaps with the target domain across object re-identification (re-ID) and detection tasks. Consequently, models trained on our searched training sets have higher accuracy than those trained otherwise. BMM allows data-centric unsupervised domain adaptation (UDA) orthogonal to existing model-centric UDA methods. By combining the BMM with existing UDA methods like pseudo-labeling, further improvement is observed.

翻译：我们探讨这样一种场景：目标域可访问，但实时数据标注不可行。因此，我们希望从大规模数据服务器中构建替代训练集，以获得具有竞争力的模型。针对该问题，由于目标域通常呈现不同的模态（即表示数据分布的语义簇），若训练集未包含这些目标模态，模型性能将受到影响。现有研究多通过迭代改进算法来应对，而本文则探索了优化数据服务器结构这一常被忽视的潜力。受网络搜索引擎层次化架构的启发，我们引入层次化数据服务器，并提出二分模态匹配算法（BMM）以实现源域与目标域的模态对齐。针对每个目标模态，我们在服务器数据树中寻找最佳模态匹配，其数据规模可大可小。通过二分匹配，我们力求所有目标模态都能以一对一方式与源模态实现最优匹配。相比现有训练集搜索算法，我们证明在目标重识别（re-ID）与检测任务中，经匹配的服务器模态所构建的训练集始终具有更小的领域差异。因此，基于我们搜索所得训练集训练的模型具有更高的准确率。BMM实现了以数据为中心的无监督域自适应（UDA），与现有以模型为中心的UDA方法形成正交互补。将BMM与伪标注等现有UDA方法结合后，可观察到进一步的性能提升。