The attention module in vision transformers(ViTs) performs intricate spatial correlations, contributing significantly to accuracy and delay. It is thereby important to modulate the number of attentions according to the input feature complexity for optimal delay-accuracy tradeoffs. To this end, we propose PIVOT - a co-optimization framework which selectively performs attention skipping based on the input difficulty. For this, PIVOT employs a hardware-in-loop co-search to obtain optimal attention skip configurations. Evaluations on the ZCU102 MPSoC FPGA show that PIVOT achieves 2.7x lower EDP at 0.2% accuracy reduction compared to LVViT-S ViT. PIVOT also achieves 1.3% and 1.8x higher accuracy and throughput than prior works on traditional CPUs and GPUs. The PIVOT project can be found at https://github.com/Intelligent-Computing-Lab-Yale/PIVOT.
翻译:视觉Transformer中的注意力模块通过复杂的空间相关性运算,对准确率和延迟具有显著影响。因此,根据输入特征的复杂度动态调节注意力机制数量以实现最优的延迟-准确率平衡至关重要。为此,我们提出PIVOT——一种基于输入难度选择性地执行注意力跳过的协同优化框架。该框架采用硬件在环协同搜索方法获取最优的注意力跳过配置。在ZCU102 MPSoC FPGA平台上的评估表明,与LVViT-S ViT相比,PIVOT在准确率仅降低0.2%的情况下实现了2.7倍的能效积提升。与传统CPU和GPU上的现有工作相比,PIVOT的准确率和吞吐量分别提升了1.3%和1.8倍。PIVOT项目开源地址:https://github.com/Intelligent-Computing-Lab-Yale/PIVOT