Word2Vec remains one of the highly-impactful innovations in the field of Natural Language Processing (NLP) that represents latent grammatical and syntactical information in human text with dense vectors in a low dimension. Word2Vec has high computational cost due to the algorithm's inherent sequentiality, intensive memory accesses, and the large vocabularies it represents. While prior studies have investigated technologies to explore parallelism and improve memory system performance, they struggle to effectively gain throughput on powerful GPUs. We identify memory data access and latency as the primary bottleneck in prior works on GPUs, which prevents highly optimized kernels from attaining the architecture's peak performance. We present a novel algorithm, FULL-W2V, which maximally exploits the opportunities for data reuse in the W2V algorithm and leverages GPU architecture and resources to reduce access to low memory levels and improve temporal locality. FULL-W2V is capable of reducing accesses to GPU global memory significantly, e.g., by more than 89\%, compared to prior state-of-the-art GPU implementations, resulting in significant performance improvement that scales across successive hardware generations. Our prototype implementation achieves 2.97X speedup when ported from Nvidia Pascal P100 to Volta V100 cards, and outperforms the state-of-the-art by 5.72X on V100 cards with the same embedding quality. In-depth analysis indicates that the reduction of memory accesses through register and shared memory caching and high-throughput shared memory reduction leads to a significantly improved arithmetic intensity. FULL-W2V can potentially benefit many applications in NLP and other domains.
翻译:Word2Vec仍然是自然语言处理(NLP)领域最具影响力的创新之一,它通过低维稠密向量表示人类文本中隐含的语法和句法信息。由于算法固有的顺序性、密集的内存访问以及其表示的大规模词汇,Word2Vec具有很高的计算成本。尽管先前的研究探索了并行技术并改进了内存系统性能,但在强大GPU上有效提升吞吐量仍面临挑战。我们发现内存数据访问和延迟是先前GPU工作中的主要瓶颈,这阻碍了高度优化的内核达到架构的峰值性能。我们提出了一种新颖算法FULL-W2V,该算法最大程度地利用W2V算法中的数据复用机会,并利用GPU架构和资源减少对低层级内存的访问,提升时间局部性。与先前最先进的GPU实现相比,FULL-W2V能够显著减少对GPU全局内存的访问(例如减少超过89%),从而在连续硬件代际上实现可扩展的性能提升。我们的原型实现从Nvidia Pascal P100移植到Volta V100显卡时获得了2.97倍的加速,且在V100显卡上以相同嵌入质量实现了比现有技术高5.72倍的性能。深入分析表明,通过寄存器与共享内存缓存以及高吞吐量共享内存规约减少内存访问,显著提升了算术强度。FULL-W2V有望惠及NLP及其他领域的众多应用。