Scaled dot-product attention applies a softmax function on the scaled dot-product of queries and keys to calculate weights and then multiplies the weights and values. In this work, we study how to improve the learning of scaled dot-product attention to improve the accuracy of DETR. Our method is based on the following observations: using ground truth foreground-background mask (GT Fg-Bg Mask) as additional cues in the weights/values learning enables learning much better weights/values; with better weights/values, better values/weights can be learned. We propose a triple-attention module in which the first attention is a plain scaled dot-product attention, the second/third attention generates high-quality weights/values (with the assistance of GT Fg-Bg Mask) and shares the values/weights with the first attention to improve the quality of values/weights. The second and third attentions are removed during inference. We call our method knowledge-sharing DETR (KS-DETR), which is an extension of knowledge distillation (KD) in the way that the improved weights and values of the teachers (the second and third attentions) are directly shared, instead of mimicked, by the student (the first attention) to enable more efficient knowledge transfer from the teachers to the student. Experiments on various DETR-like methods show consistent improvements over the baseline methods on the MS COCO benchmark. Code is available at https://github.com/edocanonymous/KS-DETR.
翻译:[translated abstract in Chinese]
缩放点积注意力通过对查询和键的缩放点积应用softmax函数来计算权重,然后将权重与值相乘。本文研究如何改进缩放点积注意力的学习,以提高DETR的准确率。我们的方法基于以下观察:在权重/值学习中使用真实前景-背景掩码(GT Fg-Bg Mask)作为额外线索,能够学习到更好的权重/值;而更好的权重/值反过来又能促进更好的值/权重学习。为此,我们提出一种三重注意力模块,其中第一个注意力为普通缩放点积注意力,第二/第三个注意力(在GT Fg-Bg Mask辅助下)生成高质量权重/值,并与第一个注意力共享值/权重以提升后者的质量。第二和第三个注意力在推理阶段被移除。我们将该方法称为知识共享DETR(KS-DETR),它是知识蒸馏(KD)的扩展形式——教师(第二和第三个注意力)改进后的权重和值直接共享给学生(第一个注意力),而非通过模仿传递,从而实现更高效的师生知识迁移。在多种DETR类方法上的实验表明,该方法在MS COCO基准测试上相比基线方法取得了持续改进。代码见https://github.com/edocanonymous/KS-DETR。