This work draws attention to the large fraction of near-duplicates in the training and test sets of datasets widely adopted in License Plate Recognition (LPR) research. These duplicates refer to images that, although different, show the same license plate. Our experiments, conducted on the two most popular datasets in the field, show a substantial decrease in recognition rate when six well-known models are trained and tested under fair splits, that is, in the absence of duplicates in the training and test sets. Moreover, in one of the datasets, the ranking of models changed considerably when they were trained and tested under duplicate-free splits. These findings suggest that such duplicates have significantly biased the evaluation and development of deep learning-based models for LPR. The list of near-duplicates we have found and proposals for fair splits are publicly available for further research at https://raysonlaroca.github.io/supp/lpr-train-on-test/
翻译:本文关注车牌识别研究中广泛采用的数据集训练集和测试集中存在大量近重复图像的问题。这些重复图像指的是虽然视觉上存在差异,但展示相同车牌号的图像。我们在该领域最流行的两个数据集上开展的实验表明,当六个知名模型在公平划分(即训练集与测试集不含重复图像)的条件下进行训练和测试时,识别率显著下降。此外,在其中一个数据集中,当模型在无重复划分条件下训练和测试时,各模型的排名发生显著变化。这些发现表明,此类重复图像已严重影响了基于深度学习的车牌识别模型的评估与开发。我们在https://raysonlaroca.github.io/supp/lpr-train-on-test/上公开了所发现的近重复图像清单及公平划分建议,供后续研究使用。