Transformers have demonstrated tremendous success not only in the natural language processing (NLP) domain but also the field of computer vision, igniting various creative approaches and applications. Yet, the superior performance and modeling flexibility of transformers came with a severe increase in computation costs, and hence several works have proposed methods to reduce this burden. Inspired by a cost-cutting method originally proposed for language models, Data Multiplexing (DataMUX), we propose a novel approach for efficient visual recognition that employs additional dim1 batching (i.e., concatenation) that greatly improves the throughput with little compromise in the accuracy. We first introduce a naive adaptation of DataMux for vision models, Image Multiplexer, and devise novel components to overcome its weaknesses, rendering our final model, ConcatPlexer, at the sweet spot between inference speed and accuracy. The ConcatPlexer was trained on ImageNet1K and CIFAR100 dataset and it achieved 23.5% less GFLOPs than ViT-B/16 with 69.5% and 83.4% validation accuracy, respectively.
翻译:Transformer不仅在自然语言处理领域取得了巨大成功,在计算机视觉领域也展现出卓越性能,激发了各种创新方法与应用。然而,Transformer优越的性能和建模灵活性伴随着计算成本的显著增加,因此多项研究提出了降低这一负担的方法。受最初为语言模型提出的成本削减方法——数据复用(Data Multiplexing, DataMUX)的启发,我们提出了一种高效的视觉识别新方法,该方法采用额外的第一维度批处理(即拼接),在不显著牺牲准确率的前提下大幅提升吞吐量。我们首先设计了DataMux在视觉模型中的直接适配版本——图像复用器(Image Multiplexer),并开发了新型组件以克服其缺陷,最终得到我们的模型ConcatPlexer,其在推理速度与准确率之间达到了最佳平衡点。ConcatPlexer在ImageNet1K和CIFAR100数据集上进行了训练,相比ViT-B/16减少了23.5%的GFLOPs,验证准确率分别达到69.5%和83.4%。