Scaling multi-dimensional transformers to long sequences is indispensable across various domains. However, the challenges of large memory requirements and slow speeds of such sequences necessitate sequence parallelism. All existing approaches fall under the category of embedded sequence parallelism, which are limited to shard along a single sequence dimension, thereby introducing significant communication overhead. However, the nature of multi-dimensional transformers involves independent calculations across multiple sequence dimensions. To this end, we propose Dynamic Sequence Parallelism (DSP) as a novel abstraction of sequence parallelism. DSP dynamically switches the parallel dimension among all sequences according to the computation stage with efficient resharding strategy. DSP offers significant reductions in communication costs, adaptability across modules, and ease of implementation with minimal constraints. Experimental evaluations demonstrate DSP's superiority over state-of-the-art embedded sequence parallelism methods by remarkable throughput improvements ranging from 32.2% to 10x, with less than 25% communication volume.
翻译:将多维Transformer扩展至长序列处理是多个领域不可或缺的需求。然而,此类序列所需的大内存占用与低处理速度问题使得序列并行成为必要。现有方法均属于嵌入式序列并行范畴,其仅限于沿单一序列维度进行分片,从而引入了显著的通信开销。然而,多维Transformer的本质涉及跨多个序列维度的独立计算。为此,我们提出动态序列并行(DSP)作为一种新颖的序列并行抽象。DSP可根据计算阶段,通过高效的重分片策略在所有序列维度间动态切换并行维度。DSP能显著降低通信成本,具有跨模块的适应性,且易于实现,约束极少。实验评估表明,DSP相较于最先进的嵌入式序列并行方法具有显著优势,吞吐量提升幅度达32.2%至10倍,而通信量降低至不足25%。