Transformer-based object detectors (DETR) have shown significant performance across machine vision tasks, ultimately in object detection. This detector is based on a self-attention mechanism along with the transformer encoder-decoder architecture to capture the global context in the image. The critical issue to be addressed is how this model architecture can handle different image nuisances, such as occlusion and adversarial perturbations. We studied this issue by measuring the performance of DETR with different experiments and benchmarking the network with convolutional neural network (CNN) based detectors like YOLO and Faster-RCNN. We found that DETR performs well when it comes to resistance to interference from information loss in occlusion images. Despite that, we found that the adversarial stickers put on the image require the network to produce a new unnecessary set of keys, queries, and values, which in most cases, results in a misdirection of the network. DETR also performed poorer than YOLOv5 in the image corruption benchmark. Furthermore, we found that DETR depends heavily on the main query when making a prediction, which leads to imbalanced contributions between queries since the main query receives most of the gradient flow.
翻译:基于Transformer的目标检测器(DETR)在机器视觉任务中展现出显著性能,尤其在目标检测领域。该检测器采用自注意力机制与Transformer编码器-解码器架构,以捕获图像的全局上下文。待解决的关键问题是:该模型架构如何处理图像中的不同干扰因素,例如遮挡与对抗性扰动。我们通过多种实验测量DETR的性能,并将其与基于卷积神经网络(CNN)的检测器(如YOLO和Faster-RCNN)进行基准对比。研究发现,DETR在抵抗遮挡图像中信息丢失的干扰时表现良好。然而,我们发现图像上的对抗性贴片会迫使网络生成一组不必要的键、查询和值,这多数情况下会导致网络误判。在图像损坏基准测试中,DETR的表现也逊于YOLOv5。此外,我们发现DETR在进行预测时高度依赖主查询,导致查询之间的贡献失衡,因为主查询接收了绝大部分梯度流。