Data pooling offers various advantages, such as increasing the sample size, improving generalization, reducing sampling bias, and addressing data sparsity and quality, but it is not straightforward and may even be counterproductive. Assessing the effectiveness of pooling datasets in a principled manner is challenging due to the difficulty in estimating the overall information content of individual datasets. Towards this end, we propose incorporating a data source prediction module into standard object detection pipelines. The module runs with minimal overhead during inference time, providing additional information about the data source assigned to individual detections. We show the benefits of the so-called dataset affinity score by automatically selecting samples from a heterogeneous pool of vehicle datasets. The results show that object detectors can be trained on a significantly sparser set of training samples without losing detection accuracy.
翻译:数据池化具有多种优势,例如增加样本量、提升泛化能力、减少采样偏差以及解决数据稀疏性和质量问题,但其过程并非直接简单,甚至可能产生反效果。由于难以估计单个数据集的整体信息含量,以原则性方式评估池化数据集的有效性颇具挑战性。为此,我们提出在标准目标检测流程中引入数据源预测模块。该模块在推理阶段以极低开销运行,提供关于单个检测结果所归属数据源的附加信息。通过从异构车辆数据集池中自动挑选样本,我们展示了所谓数据集亲和力分数的优势。结果表明,目标检测器可以在训练样本显著稀疏的情况下进行训练,且不损失检测精度。