Padding is often used in tuning LLM models by adding special tokens to shorter training examples to match the length of the longest sequence in each batch. While this ensures uniformity for batch processing, it introduces inefficiencies by including irrelevant padding tokens in the computation and wastes GPU resources. On the other hand, the Hugging Face SFT trainer offers the option to use packing to combine multiple training examples up to the maximum sequence length. This allows for maximal utilization of GPU resources. However, without proper masking of each packed training example, attention will not be computed correctly when using SFT trainer. We enable and then analyse packing and Flash Attention with proper attention masking of each example and show the benefits of this training paradigm.
翻译:在大型语言模型微调过程中,常通过向较短的训练样本添加特殊填充符,使其与批次中最长序列长度对齐。虽然这确保了批次处理的统一性,但计算中包含无关的填充符会引入低效性,并浪费GPU资源。另一方面,Hugging Face SFT训练器提供了打包功能,可将多个训练样本组合至最大序列长度,从而实现GPU资源的最大化利用。然而,若未对每个打包样本进行恰当的掩码处理,使用SFT训练器时注意力机制将无法正确计算。本研究通过为每个样本配置正确的注意力掩码,实现了打包技术与Flash Attention的结合应用,并系统分析了该训练范式的优势。