权重衰减论文 - 专知

会员服务 ·

权重衰减

Cautious Weight Decay

Arxiv

0+阅读 · 2月24日

Logarithmic-time Schedules for Scaling Language Models with Momentum

Arxiv

0+阅读 · 2月18日

Optimizer choice matters for the emergence of Neural Collapse

Arxiv

0+阅读 · 2月18日

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Encoder-Only and Decoder-Only Transformers

Arxiv

0+阅读 · 2月6日

Weight Decay may matter more than muP for Learning Rate Transfer in Practice

Arxiv

0+阅读 · 2月13日

Weight Decay Improves Language Model Plasticity

Arxiv

0+阅读 · 2月11日

Provable Emergence of Deep Neural Collapse and Low-Rank Bias in $L^2$-Regularized Nonlinear Networks

Arxiv

0+阅读 · 2月11日

Correctness-Optimized Residual Activation Lens (CORAL): Transferrable and Calibration-Aware Inference-Time Steering

Arxiv

0+阅读 · 2月5日

Logarithmic-time Schedules for Scaling Language Models with Momentum

Arxiv

0+阅读 · 2月5日

Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

Arxiv

0+阅读 · 1月8日

AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

Arxiv

0+阅读 · 2025年11月5日

Superposition Yields Robust Neural Scaling

Arxiv

0+阅读 · 2025年11月29日

Homeostatic Ubiquity of Hebbian Dynamics in Regularized Learning Rules

Arxiv

0+阅读 · 2025年11月30日

On the Neural Feature Ansatz for Deep Neural Networks

Arxiv

0+阅读 · 2025年10月17日

Robust Layerwise Scaling Rules by Proper Weight Decay Tuning

Arxiv

0+阅读 · 2025年10月17日

参考链接

微信扫码咨询专知VIP会员