We present a controlled empirical comparison between autoregressive (AR) and masked diffusion (MDLM) language models. Both models are trained on identical data (50M tokens from TinyStories), identical compute budget (20,000 steps, batch size 32, sequence length 512), and identical hardware (NVIDIA H100 80GB), isolating the generation paradigm as the sole variable. We report three findings. First, both paradigms achieve comparable training throughput (~50K tokens/second), with MDLM requiring only 4.7% more wall-clock time. Second, AR converges faster and begins overfitting by step 14,000, while MDLM converges more slowly and is still improving at step 20,000, suggesting different compute-optimal training regimes. Third, quantitative diversity analysis over 1,000 generated samples reveals a structural diversity-fluency trade-off: AR produces fluent but repetitive outputs (99.8% begin with the same word), while MDLM generates more diverse narratives (93.4% unique 5-word openings, higher Distinct-n, lower Self-BLEU), at the cost of occasional grammatical inconsistencies. All code, trained checkpoints, and data pipelines are released for reproducibility.
翻译:我们提出了自回归(AR)与掩码扩散(MDLM)语言模型之间的一项受控实证比较。两个模型在相同数据(来自TinyStories的5000万词元)、相同计算预算(20,000步,批量大小32,序列长度512)以及相同硬件(NVIDIA H100 80GB)上进行训练,将生成范式隔离为唯一变量。我们报告了三项发现。首先,两种范式实现了相当的训练吞吐量(约每秒5万词元),MDLM仅需要多出4.7%的挂钟时间。其次,AR收敛更快,并在第14,000步开始过拟合,而MDLM收敛较慢,且在第20,000步仍在改善,这表明存在不同的计算最优训练机制。第三,对1,000个生成样本的定量多样性分析揭示了结构性的多样性-流畅性权衡:AR能生成流畅但重复的输出(99.8%以相同单词开头),而MDLM生成更多样化的叙述(93.4%独特的5词开头,更高的Distinct-n,更低的Self-BLEU),但代价是偶尔出现语法不一致。所有代码、训练好的检查点及数据流水线均已发布,以确保可复现性。