Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.
翻译:现代分层视觉Transformer在追求监督分类性能的过程中,添加了若干视觉专用组件。虽然这些组件带来了有效的精度和吸引人的FLOP计数,但增加的复杂度实际上使得这些Transformer比其原始ViT版本运行更慢。本文认为这种额外负担并无必要。通过采用强视觉前置任务(MAE)进行预训练,我们可从一个最先进的多阶段视觉Transformer中剥离所有冗余组件而不损失精度。在此过程中,我们创建了Hiera——一种极为简洁的分层视觉Transformer,其在保持更高精度的同时,推理速度和训练速度均显著快于先前模型。我们在图像和视频识别的多种任务上对Hiera进行了评估。我们的代码和模型已开源至https://github.com/facebookresearch/hiera。