Long-context language modeling is commonly framed as a scalability challenge of token-level attention, yet local-to-global information structuring remains largely implicit in existing approaches. Drawing on cognitive theories of discourse comprehension, we propose HiCI (Hierarchical Construction--Integration), a hierarchical attention module that constructs segment-level representations, integrates them into a shared global context, and broadcasts both to condition segment-level attention. We validate HiCI through parameter-efficient adaptation of LLaMA-2 with only <5.5% additional parameters, extending context from 4K to 100K tokens (7B) and 64K tokens (13B). Across language modeling, retrieval, and instruction-following benchmarks, HiCI yields consistent improvements over strong baselines, including matching proprietary models on topic retrieval and surpassing GPT-3.5-Turbo-16K on code comprehension. These results demonstrate the effectiveness of explicit hierarchical structuring as an inductive bias for long-context modeling.
翻译:长上下文语言建模通常被视作token级注意力的可扩展性挑战,然而现有方法中局部到全局的信息结构化仍很大程度上是隐式的。受话语理解的认知理论启发,我们提出HiCI(分层构建-整合)——一种分层注意力模块,该模块构建片段级表征,将其整合到共享全局上下文中,并广播这两者以条件化片段级注意力。我们通过对LLaMA-2进行参数高效适配来验证HiCI,仅增加不到5.5%的参数,即可将上下文长度从4K扩展至100K个token(7B模型)和64K个token(13B模型)。在语言建模、检索和指令遵循基准测试中,HiCI相比强基线方法取得了持续提升,包括在主题检索任务上匹配商业模型,以及在代码理解任务上超越GPT-3.5-Turbo-16K。这些结果表明,显式分层结构化作为长上下文建模的归纳偏置具有显著有效性。