Transformers were initially introduced for natural language processing (NLP) tasks, but fast they were adopted by most deep learning fields, including computer vision. They measure the relationships between pairs of input tokens (words in the case of text strings, parts of images for visual Transformers), termed attention. The cost is exponential with the number of tokens. For image classification, the most common Transformer Architecture uses only the Transformer Encoder in order to transform the various input tokens. However, there are also numerous other applications in which the decoder part of the traditional Transformer Architecture is also used. Here, we first introduce the Attention mechanism (Section 1), and then the Basic Transformer Block including the Vision Transformer (Section 2). Next, we discuss some improvements of visual Transformers to account for small datasets or less computation(Section 3). Finally, we introduce Visual Transformers applied to tasks other than image classification, such as detection, segmentation, generation and training without labels (Section 4) and other domains, such as video or multimodality using text or audio data (Section 5).
翻译:Transformer最初是为自然语言处理(NLP)任务引入的,但很快被包括计算机视觉在内的大多数深度学习领域所采用。它们通过所谓的注意力机制,度量输入词元对(文本字符串中的单词,或视觉Transformer中的图像部分)之间的关系,其计算成本随词元数量呈指数增长。对于图像分类,最常见的Transformer架构仅使用Transformer编码器来转换各种输入词元。然而,传统Transformer架构的解码器部分也有许多其他应用。本文首先介绍了注意力机制(第1节),然后介绍了包括视觉Transformer在内的基本Transformer模块(第2节)。接着,我们讨论了视觉Transformer为适应小数据集或减少计算量而进行的一些改进(第3节)。最后,我们介绍了视觉Transformer在图像分类之外的任务中的应用,例如检测、分割、生成和无标签训练(第4节),以及其他领域,如视频或使用文本或音频数据实现的多模态(第5节)。