As we enter the era of big data, collecting high-quality data is very important. However, collecting data by humans is not only very time-consuming but also expensive. Therefore, many scientists have devised various methods to collect data using computers. Among them, there is a method called web crawling, but the authors found that the crawling method has a problem in that unintended data is collected along with the user. The authors found that this can be filtered using the object recognition model YOLOv10. However, there are cases where data that is not properly filtered remains. Here, image reclassification was performed by additionally utilizing the distance output from the Siamese network, and higher performance was recorded than other classification models. (average \_f1 score YOLO+MobileNet 0.678->YOLO+SiameseNet 0.772)) The user can specify a distance threshold to adjust the balance between data deficiency and noise-robustness. The authors also found that the Siamese network can achieve higher performance with fewer resources because the cropped images are used for object recognition when processing images in the Siamese network. (Class 20 mean-based f1 score, non-crop+Siamese(MobileNetV3-Small) 80.94 -> crop preprocessing+Siamese(MobileNetV3-Small) 82.31) In this way, the image retrieval system that utilizes two consecutive models to reduce errors can save users' time and effort, and build better quality data faster and with fewer resources than before.
翻译:随着大数据时代的到来,采集高质量数据变得至关重要。然而,人工采集数据不仅耗时,而且成本高昂。因此,许多研究者提出了利用计算机采集数据的多种方法。其中,网络爬虫是一种常用方法,但作者发现爬虫方法存在采集到非目标数据的问题。作者发现可以利用目标检测模型YOLOv10对此类数据进行过滤。然而,仍存在未能被完全滤除的残留数据。为此,本文通过额外利用孪生网络输出的距离度量进行图像重分类,其性能优于其他分类模型(平均F1分数:YOLO+MobileNet 0.678 → YOLO+SiameseNet 0.772)。用户可通过设定距离阈值来调整数据完备性与噪声鲁棒性之间的平衡。作者还发现,由于孪生网络处理图像时使用裁剪后的图像进行目标识别,该网络能够以更少的计算资源实现更高的性能(基于20个类别的平均F1分数:未裁剪+Siamese(MobileNetV3-Small) 80.94 → 裁剪预处理+Siamese(MobileNetV3-Small) 82.31)。这种通过串联两个模型以降低误差的图像检索系统,能够节省用户的时间与精力,并以更快的速度和更少的资源构建出质量更优的数据集。