Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, ConcatPlexer, at the sweet spot between inference speed and accuracy. The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% validation accuracy, respectively.
翻译:Transformer不仅在自然语言处理领域取得了巨大成功,在计算机视觉领域也展现了卓越表现,催生了各种创新方法与应用。然而,Transformer优异的性能与建模灵活性伴随着计算成本的显著增加,因此多项研究提出了降低这一负担的方法。受最初为语言模型提出的降本方法Data Multiplexing(DataMUX)启发,我们提出了一种高效视觉识别的新方法,通过额外的维度1批处理(即拼接)大幅提升吞吐量,同时几乎不牺牲精度。我们首先为视觉模型引入DataMux的朴素适配版本Image Multiplexer,并设计新颖组件克服其缺陷,最终得到兼顾推理速度与精度的理想模型ConcatPlexer。该模型在ImageNet1K和CIFAR100数据集上训练,相比ViT-B/16减少了23.5%的GFLOPs,验证精度分别达到69.5%和83.4%。