Deploying NMT models on mobile devices is essential for privacy, low latency, and offline scenarios. For high model capacity, NMT models are rather large. Running these models on devices is challenging with limited storage, memory, computation, and power consumption. Existing work either only focuses on a single metric such as FLOPs or general engine which is not good at auto-regressive decoding. In this paper, we present MobileNMT, a system that can translate in 15MB and 30ms on devices. We propose a series of principles for model compression when combined with quantization. Further, we implement an engine that is friendly to INT8 and decoding. With the co-design of model and engine, compared with the existing system, we speed up 47.0x and save 99.5% of memory with only 11.6% loss of BLEU. The code is publicly available at https://github.com/zjersey/Lightseq-ARM.
翻译:摘要:在移动设备上部署NMT模型对于隐私保护、低延迟和离线场景至关重要。然而,为追求高模型容量,NMT模型体积较大。在有限的存储、内存、计算资源和功耗约束下,在设备上运行这些模型颇具挑战性。现有工作要么仅针对单一指标(如FLOPs)进行优化,要么采用不擅长自回归解码的通用引擎。本文提出MobileNMT系统,可在15MB和30毫秒内于设备上完成翻译。我们提出了一系列结合量化技术的模型压缩原则,并进一步实现了一个对INT8和解码友好的引擎。通过模型与引擎的协同设计,与现有系统相比,我们实现了47.0倍的加速,内存占用减少99.5%,而BLEU值仅下降11.6%。相关代码已开源在https://github.com/zjersey/Lightseq-ARM。