Long-Context Modeling via GSS-Transformer Hybrid Architecture with Learnable Mixing

Modeling long-range dependencies remains a central challenge in natural language processing. Transformer architectures achieve strong performance via self-attention but scale quadratically ($O(N^2)$) with sequence length, while State Space Models (SSMs) scale linearly ($O(N)$) but suffer from a selective recall bottleneck, struggling to retrieve precise information from compressed states. This creates a fundamental tradeoff between efficiency and perplexity. To tackle these challenges, we propose the \textit{Parallel Hybrid Architecture (PHA)}, which runs Gated State Spaces (GSS), Grouped Query Attention (GQA), and Feed-Forward Networks (FFNs) as independent parallel branches fused by a learnable mixing mechanism. Instead of forcing SSMs to approximate attention or serializing the two paradigms, PHA allows each branch to specialize: GSS captures global context, while attention performs selective retrieval, with FFN providing complementary processing. On WikiText-103, PHA achieves 16.51 PPL at 125M parameters, outperforming Hedgehog (16.70) and H3-125M (23.70). Scaling to 180M parameters yields 16.42 PPL, which gives comparable results with the pure attention baseline while delivering 24\% higher throughput and up to 40\% lower memory usage at long contexts. On OpenWebText, our 125M model achieves 19.72 PPL, outperforming standard Transformers (20.60) and GSS hybrid baselines (19.80). These results demonstrate that separating sequence modeling paradigms into parallel specialists enables Transformer-level perplexity with substantially improved efficiency for long-context language modeling.

翻译：长距离依赖建模依旧是自然语言处理中的核心挑战。Transformer架构通过自注意力机制实现强性能，但计算复杂度随序列长度呈二次增长($O(N^2)$)，而状态空间模型(SSMs)虽呈线性复杂度($O(N)$)，却存在选择性回忆瓶颈，难以从压缩状态中精确检索信息，导致效率与困惑度之间存在本质权衡。为解决这些问题，我们提出\textit{并行混合架构(PHA)}，该架构将门控状态空间(GSS)、分组查询注意力(GQA)和前馈网络(FFN)作为独立并行分支运行，并通过可学习融合机制进行集成。PHA不强制SSM逼近注意力或将两种范式序列化，而是让各分支实现专业化：GSS捕捉全局上下文，注意力执行选择性检索，FFN提供互补处理。在WikiText-103上，PHA在125M参数量下达到16.51 PPL，优于Hedgehog (16.70)和H3-125M (23.70)。扩展至180M参数量时获得16.42 PPL，与纯注意力基线性能相当，同时在长上下文中吞吐量提升24%，内存使用降低40%。在OpenWebText上，我们的125M模型取得19.72 PPL，优于标准Transformer (20.60)和GSS混合基线(19.80)。这些结果表明，将序列建模范式分离为并行专化模块，能在实现Transformer级困惑度的同时，显著提升长上下文语言建模的效率。