Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Large language models have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g., low-rank approximation or attention pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in convolutional networks, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices (Q, K, V, O) into shared dictionary atoms, reducing the attention module's parameters by 66.7\% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement-trained with standard optimizers - and represents each layer's weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than GQA, low-rank baselines and recent Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification tasks with 66.7\% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on large pretrained models to reduce their number of parameters without experiencing any significant drop in their performance.

翻译：大型语言模型已经彻底改变了人工智能应用，但其高昂的计算和内存需求阻碍了其广泛部署。现有的压缩技术侧重于块内优化（例如低秩近似或注意力剪枝），而Transformer的重复层状结构意味着显著的块间冗余——这一维度除了键值（KV）缓存外，在很大程度上尚未被探索。受卷积网络中字典学习的启发，我们提出了一个跨Transformer层的结构化权重共享框架。我们的方法将注意力投影矩阵（Q、K、V、O）分解为共享的字典原子，将注意力模块的参数减少66.7%，同时实现相当的性能。与需要蒸馏或架构更改的复杂方法不同，MASA（注意力中的矩阵原子共享）可作为即插即用的替代方案——使用标准优化器训练——并将每层的权重表示为共享矩阵原子的线性组合。跨规模（1亿至7亿参数）的实验表明，在可比较的参数预算下，MASA在基准测试准确性和困惑度方面优于GQA、低秩基线以及最近的重复全共享/顺序共享方法。消融研究证实了其对字典大小的鲁棒性，以及共享表示在捕获跨层统计规律方面的有效性。扩展到视觉Transformer（ViT），MASA在图像分类任务上以注意力参数减少66.7%的代价，实现了匹配的性能指标。通过将字典学习策略与Transformer效率相结合，MASA为参数高效模型提供了一个可扩展的蓝图，且无需牺牲性能。最后，我们探讨了在大型预训练模型上应用MASA以减少其参数数量而不出现显著性能下降的可能性。