The Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications. However, ViTs are ill-suited for private inference using secure multi-party computation (MPC) protocols, due to the large number of non-polynomial operations (self-attention, feed-forward rectifiers, layer normalization). We propose PriViT, a gradient based algorithm to selectively "Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy. Our algorithm is conceptually simple, easy to implement, and achieves improved performance over existing approaches for designing MPC-friendly transformer architectures in terms of achieving the Pareto frontier in latency-accuracy. We confirm these improvements via experiments on several standard image classification tasks. Public code is available at https://github.com/NYU-DICE-Lab/privit.
翻译:视觉Transformer(ViT)架构已成为计算机视觉应用中先进深度模型的首选主干网络。然而,由于存在大量非多项式操作(自注意力、前馈整流器、层归一化),ViT难以适用于基于安全多方计算(MPC)协议的隐私推理。我们提出PriViT——一种基于梯度的算法,能够在保持ViT预测精度的同时,选择性地对其非线性函数进行“泰勒化”。该算法概念简洁、易于实现,并且在延迟-准确率的帕累托前沿权衡方面,相较于现有面向MPC友好的Transformer架构设计方法,实现了更优的性能。我们通过在多个标准图像分类任务上的实验验证了这些改进。公开代码见https://github.com/NYU-DICE-Lab/privit。