Autoregressive language models are conventionally defined over discrete token sequences, committing to a specific token at every generation step. This early discretization forces uncertainty to be resolved through token-level sampling, often leading to instability, repetition, and sensitivity to decoding heuristics. In this work, we introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that \emph{mature} over multiple update steps before being discretized. Rather than sampling tokens, the model evolves continuous token representations through a deterministic dynamical process, committing to a discrete token only when the representation has sufficiently converged. Discrete text is recovered via hard decoding, while uncertainty is maintained and resolved in the continuous space. We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax), without reliance on token-level sampling, diffusion-style denoising, or auxiliary stabilization mechanisms. Additional perturbations, such as stochastic dynamics or history smoothing, can be incorporated naturally but are not required for the model to function. To our knowledge, this is the first autoregressive language model that generates text by evolving continuous token representations to convergence prior to discretization, enabling stable generation without token-level sampling.
翻译:自回归语言模型传统上定义在离散令牌序列之上,在每个生成步骤中必须确定一个具体的令牌。这种早期离散化迫使不确定性必须通过令牌级采样来解决,常常导致生成结果不稳定、重复以及对解码启发式方法的敏感性。在本研究中,我们提出了一种连续自回归语言生成框架,其中令牌被表示为连续向量,这些向量在离散化之前会经历多个更新步骤的"成熟"过程。该模型不是通过采样令牌,而是通过确定性动态过程演化连续令牌表示,仅当表示充分收敛时才确定离散令牌。离散文本通过硬解码恢复,而不确定性则在连续空间中得以保持和解决。我们证明,仅凭这种成熟过程就足以使用确定性解码(argmax)生成连贯且多样化的文本,无需依赖令牌级采样、扩散式去噪或辅助稳定机制。额外的扰动(如随机动态或历史平滑)可以自然地融入该框架,但并非模型运行的必要条件。据我们所知,这是首个通过演化连续令牌表示至收敛再进行离散化的自回归语言模型,实现了无需令牌级采样的稳定生成。