Large language models (LLMs) excel across many tasks, yet inference is still dominated by strictly token-by-token autoregression. Existing acceleration methods largely patch this pipeline and miss core human-reading ingredients: content-adaptive foresight, chunk-structure-aware compute allocation, and train--test consistency for preview/skimming. We propose the \textbf{Fovea-Block-Skip Transformer} (FBS), which injects a causal, trainable loop into Transformers via Parafovea-Attention Window (PAW), Chunk-Head (CH), and Skip-Gate (SG). Across diverse benchmarks, FBS improves the quality-efficiency trade-off without increasing parameters, and ablations show the three modules are complementary.
翻译:大型语言模型(LLMs)在众多任务中表现出色,但其推理过程仍主要受限于严格的逐词自回归生成。现有的加速方法大多是对这一流程的修补,未能捕捉人类阅读的核心要素:内容自适应前瞻、块结构感知的计算分配,以及预览/略读的训练-测试一致性。我们提出\textbf{Fovea-Block-Skip Transformer}(FBS),该模型通过旁中央凹注意力窗口(PAW)、块头(CH)和跳过门(SG)向Transformer注入了一个因果、可训练的循环机制。在多样化基准测试中,FBS在不增加参数的情况下改善了质量-效率权衡,消融实验表明三个模块具有互补性。