Large token-indexed lookup tables provide a compute-decoupled scaling path, but their practical gains are often limited by poor parameter efficiency and rapid memory growth. We attribute these limitations to Zipfian under-training of the long tail, heterogeneous demand across layers, and "slot collapse" that produces redundant embeddings. To address this, we propose X-GRAM, a frequency-aware dynamic token-injection framework. X-GRAM employs hybrid hashing and alias mixing to compress the tail while preserving head capacity, and refines retrieved vectors via normalized SwiGLU ShortConv to extract diverse local n-gram features. These signals are integrated into attention value streams and inter-layer residuals using depth-aware gating, effectively aligning static memory with dynamic context. This design introduces a memory-centric scaling axis that decouples model capacity from FLOPs. Extensive evaluations at the 0.73B and 1.15B scales show that X-GRAM improves average accuracy by as much as 4.4 points over the vanilla backbone and 3.2 points over strong retrieval baselines, while using substantially smaller tables in the 50% configuration. Overall, by decoupling capacity from compute through efficient memory management, X-GRAM offers a scalable and practical paradigm for future memory-augmented architectures. Code aviliable in https://github.com/Longyichen/X-gram.
翻译:大型令牌索引查找表提供了一种计算解耦的扩展路径,但其实际收益常受限于参数效率低下和内存快速增长。我们将这些限制归因于长尾分布的齐普夫欠训练、跨层异构需求以及产生冗余嵌入的“槽位坍塌”。为此,我们提出X-GRAM——一种频率感知的动态令牌注入框架。X-GRAM采用混合哈希与别名混合技术压缩尾部分布同时保留头部容量,并通过归一化SwiGLU ShortConv精炼检索向量以提取多样化局部n元语法特征。这些信号通过深度感知门控机制集成至注意力值流与层间残差中,有效对齐静态内存与动态上下文。该设计引入以内存为中心的扩展轴,实现模型容量与FLOPs的解耦。在0.73B与1.15B参数规模上的大量评估表明,X-GRAM相比原始骨干网络平均准确率提升最高达4.4个百分点,相比强检索基线提升3.2个百分点,同时采用仅50%配置的显著更小型表。总体而言,通过高效内存管理实现容量与计算解耦,X-GRAM为未来记忆增强型架构提供了可扩展且实用的范式。代码开源于https://github.com/Longyichen/X-gram。