Transformer architecture gradually dominates the LLM field. Recent advances in training optimization for Transformer-based large language models (LLMs) primarily focus on architectural modifications or optimizer adjustments. However, these approaches lack systematic optimization of weight patterns during training. Weight pattern refers to the distribution and relative magnitudes of weight parameters in a neural network. To address this issue, we propose a Weight Scaling method called WISCA to enhance training efficiency and model quality by strategically improving neural network weight patterns without changing network structures. By rescaling weights while preserving model outputs, WISCA indirectly optimizes the model's training trajectory. Experiments demonstrate that WISCA significantly improves convergence quality (measured by generalization capability and loss reduction), particularly in LLMs with Grouped Query Attention (GQA) architectures and LoRA fine-tuning tasks. Empirical results show 5.6% average improvement on zero-shot validation tasks and 2.12% average reduction in training perplexity across multiple architectures.
翻译:Transformer架构逐渐主导了大语言模型领域。近期针对基于Transformer的大语言模型训练优化的研究主要集中在架构修改或优化器调整上。然而,这些方法在训练过程中缺乏对权重模式的系统性优化。权重模式指神经网络中权重参数的分布和相对大小。为解决这一问题,我们提出了一种名为WISCA的权重缩放方法,通过在不改变网络结构的情况下策略性地优化神经网络权重模式,提升训练效率和模型质量。WISCA在保持模型输出的同时重新缩放权重,间接优化了模型的训练轨迹。实验表明,WISCA显著提高了收敛质量(通过泛化能力和损失降低衡量),特别是在采用分组查询注意力架构的大语言模型和LoRA微调任务中。实证结果显示,在多种架构下,零样本验证任务平均提升5.6%,训练困惑度平均降低2.12%。