We present tensor and sequence parallelism (TSP), a parallel execution strategy that folds tensor parallelism and sequence parallelism onto a single device axis. In conventional multi-dimensional parallelism layouts, tensor parallelism (TP) shards model weights while sequence parallelism (SP) shards tokens, reducing per-device parameter or activation memory, respectively. Traditionally, each scheme is assigned its own mesh dimension. TSP instead assigns each rank both a weight shard and a sequence shard, reducing both parameter and activation memory along the same device axis. We implement this design with two runtime schedules. For attention, ranks iterate over broadcast parameter shards and reconstruct context through a sequence-wise key/value exchange. For gated MLPs, weight shards circulate in a ring while partial outputs accumulate locally. By sharding both weights and activations across the same devices, TSP trades additional communication volume for reduced memory overhead. We provide a theoretical communication and memory analysis, describe our implementation of TSP attention and gated MLP blocks, and benchmark TSP against TP, SP, and TP+SP. These results position TSP as a hardware-aware alternative for long-context and memory-constrained model training, and as a viable axis of parallelism in concert with existing parallelism schemes such as pipeline and expert parallelism for dense and mixture-of-expert models.
翻译:我们提出了张量与序列并行(TSP),一种将张量并行与序列并行折叠到单一设备轴上的并行执行策略。在传统的多维并行布局中,张量并行(TP)分割模型权重,而序列并行(SP)分割词元,分别减少每个设备上的参数或激活内存。传统上,每种方案都分配自己的网格维度。TSP则让每个设备同时持有权重分片和序列分片,在同一个设备轴上同时减少参数和激活内存。我们通过两种运行时调度来实现这一设计。对于注意力模块,各设备遍历广播的参数分片,并通过序列维度的键/值交换重建上下文。对于门控MLP,权重分片在环形结构内循环,同时局部输出在本地累积。通过在相同设备上分片权重和激活,TSP以增加通信量为代价降低了内存开销。我们提供了理论上的通信和内存分析,描述了TSP注意力与门控MLP模块的实现细节,并对比了TSP与TP、SP及TP+SP的性能。这些结果将TSP定位为一种硬件感知的替代方案,适用于长上下文和内存受限的模型训练,同时也可与现有并行方案(如流水线并行和专家并行)协同工作,支持密集型与混合专家模型。