The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and linear pre-filling, while partial recursion enhances representational depth with limited overhead. Together, YOCO-U improves token utility and scaling behavior while maintaining efficient inference. Empirical results confirm that YOCO-U remains highly competitive in general and long-context benchmarks, demonstrating that the integration of efficient-attention architectures and recursive computation is a promising direction for scalable LLMs.
翻译:测试时计算的普及显著提升了大型语言模型(LLMs)的推理与代理能力。然而,标准Transformer因传统循环策略存在计算开销大、KV缓存随模型深度膨胀等问题,难以高效扩展推理时的计算量。本文提出通用YOCO(YOCO-U),将YOCO解码器-解码器架构与递归计算相结合,产生超越单一技术的协同效应。YOCO-U基于YOCO框架构建,通过参数共享实现通用自解码器的多轮迭代,并将迭代过程限定于浅层高效注意力层内。这种组合实现了YOCO与递归各自独立无法达成的能力-效率平衡:YOCO架构提供恒定的全局KV缓存与线性预填充,而部分递归以有限开销增强表征深度。两者结合使YOCO-U在保持高效推理的同时,提升了令牌利用率与扩展性能。实验证实,YOCO-U在通用任务与长上下文基准测试中保持高度竞争力,表明高效注意力架构与递归计算的融合是构建可扩展LLM的可行方向。