Data leakage is a very common problem that is often overlooked during splitting data into train and test sets before training any ML/DL model. The model performance gets artificially inflated with the presence of data leakage during the evaluation phase which often leads the model to erroneous prediction on real-time deployment. However, detecting the presence of such leakage is challenging, particularly in the object detection context of perception systems where the model needs to be supplied with image data for training. In this study, we conduct a computational experiment on the Cirrus dataset from our industrial partner Volvo Cars to develop a method for detecting data leakage. We then evaluate the method on another public dataset, Kitti, which is a popular and widely accepted benchmark dataset in the automotive domain. The results show that thanks to our proposed method we are able to detect data leakage in the Kitti dataset, which was previously unknown.
翻译:数据泄漏是机器学习/深度学习模型训练前划分训练集与测试集时普遍存在却常被忽视的问题。评估阶段存在数据泄漏会人为虚增模型性能,常导致模型在实时部署时产生错误预测。然而,此类泄漏的检测颇具挑战性,尤其在感知系统的目标检测场景中,模型训练需依赖图像数据输入。本研究基于工业合作伙伴沃尔沃汽车的Cirrus数据集开展计算实验,开发了一种数据泄漏检测方法。随后在汽车领域公认的公开基准数据集Kitti上对该方法进行评估。结果表明,所提方法成功检测出Kitti数据集中先前未知的数据泄漏现象。