Sub-bit model compression seeks storage below one bit per weight; as magnitudes are aggressively compressed, the sign bit becomes a fixed-cost bottleneck. Across Transformers, CNNs, and MLPs, learned sign matrices resist low-rank approximation and are spectrally indistinguishable from an i.i.d. Rademacher baseline. Despite this apparent randomness, most weights retain their initialization signs; flips primarily occur via rare near-zero boundary crossings, suggesting that sign-pattern randomness is largely inherited from initialization. We formalize this behavior with sign lock-in theory, a stopping-time analysis of sign flips under SGD noise. Under bounded updates and a rare re-entry condition into a small neighborhood around zero, the number of effective sign flips exhibits a geometric tail. Building on this mechanism, we introduce a gap-based initialization and a lightweight outward-drift regularizer, reducing the effective flip rate to approximately $10^{-3}$ with only about a one-point increase in perplexity.
翻译:亚比特模型压缩旨在将存储需求降至每权重低于一比特;随着权重幅度被大幅压缩,符号位成为一个固定成本的瓶颈。在Transformer、CNN和MLP模型中,学习到的符号矩阵难以进行低秩近似,且其谱特性与独立同分布的Rademacher基线无法区分。尽管这种符号模式看似随机,但大多数权重保持了其初始化符号;符号翻转主要通过罕见的接近零边界穿越发生,这表明符号模式的随机性主要继承自初始化过程。我们通过符号锁定理论形式化这一行为——该理论基于SGD噪声下符号翻转的停时分析。在更新有界且重新进入零附近小邻域条件罕见的情况下,有效符号翻转次数呈现几何尾分布。基于此机制,我们提出一种基于间隔的初始化方法和轻量级外漂移正则器,将有效翻转率降至约$10^{-3}$,同时困惑度仅增加约一个点。