Recent advances in Transformer architectures [1] have brought remarkable improvements to visual question answering (VQA). Nevertheless, Transformer-based VQA models are usually deep and wide to guarantee good performance, so they can only run on powerful GPU servers and cannot run on capacity-restricted platforms such as mobile phones. Therefore, it is desirable to learn an elastic VQA model that supports adaptive pruning at runtime to meet the efficiency constraints of different platforms. To this end, we present the bilaterally slimmable Transformer (BST), a general framework that can be seamlessly integrated into arbitrary Transformer-based VQA models to train a single model once and obtain various slimmed submodels of different widths and depths. To verify the effectiveness and generality of this method, we integrate the proposed BST framework with three typical Transformer-based VQA approaches, namely MCAN [2], UNITER [3], and CLIP-ViL [4], and conduct extensive experiments on two commonly-used benchmark datasets. In particular, one slimmed MCAN-BST submodel achieves comparable accuracy on VQA-v2, while being 0.38x smaller in model size and having 0.27x fewer FLOPs than the reference MCAN model. The smallest MCAN-BST submodel only has 9M parameters and 0.16G FLOPs during inference, making it possible to deploy it on a mobile device with less than 60 ms latency.
翻译:近期Transformer架构[1]的进展显著提升了视觉问答(VQA)性能。然而,基于Transformer的VQA模型通常需要较深较宽的架构才能保证良好性能,因此仅能在高性能GPU服务器上运行,无法部署于手机等资源受限平台。因此,亟需学习一种弹性VQA模型,支持运行时自适应剪枝以满足不同平台的效率约束。为此,我们提出双向可精简Transformer(BST)——一个可无缝集成至任意基于Transformer的VQA模型的通用框架,只需一次训练即可获得不同宽度和深度的多种精简子模型。为验证该方法的有效性与通用性,我们将所提BST框架集成至三种典型基于Transformer的VQA方法(MCAN[2]、UNITER[3]和CLIP-ViL[4]中),并在两个常用基准数据集上开展广泛实验。特别地,某个精简后的MCAN-BST子模型在VQA-v2上取得了与基准MCAN模型相当的准确率,同时模型体积缩小0.38倍,计算量(FLOPs)减少0.27倍。最小的MCAN-BST子模型推理时仅需9M参数和0.16G FLOPs,使其能够以低于60毫秒的延迟部署于移动设备。