A Quantitative Experimental Repeated Measures Study of Training Dynamics in a Small Llama Style Language Model Under a Compute-Aware Token Budget

This study examines training dynamics in a small Llama-style language model trained under a fixed, compute-constrained token budget. Rather than evaluating efficiency solely through endpoint performance, the study uses a quantitative experimental repeated measures design to analyze how validation loss, validation perplexity, rolling volatility, backslide behavior, spike behavior, and between-seed variability change across token-based training intervals. Six independent training runs were conducted on a 4.26-million-parameter model using the TinyStories corpus, CPU-based full-precision training, and a target budget of approximately 20 million cumulative training tokens. Metrics were collected across 21 intervals, producing 126 seed-by-interval observations. Repeated measures ANOVA showed statistically significant interval effects for validation loss, validation perplexity, and rolling volatility. Descriptive trajectories revealed rapid early improvement followed by non-monotonic degradation during later training intervals. Mean validation loss decreased from 8.3552 at initialization to 2.7996 near 4 million tokens, but increased to 3.9010 by the final checkpoint. Validation perplexity followed the same pattern, falling sharply early in training before rising later. Derived telemetry further showed recurrent validation-loss backslides and no interval-summary evidence of a stable phase under the predefined criteria. These findings suggest that compute-aware language model evaluation should examine training trajectories rather than endpoint metrics alone. In constrained compute settings, additional token exposure may increase computational cost without producing proportional generalization gains, and interval-level telemetry can reveal instability, regression, and diminishing returns that final metrics may obscure.

翻译：摘要：本研究在固定计算约束的令牌预算下，考察了小型Llama风格语言模型的训练动力学。不同于仅通过终点性能评估效率，本项研究采用定量实验重复测量设计，分析验证损失、验证困惑度、滚动波动性、回退行为、尖峰行为以及种子间变异性在基于令牌的训练区间内的变化。研究在拥有426万参数的模型上进行了六次独立训练运行，使用TinyStories语料库、基于CPU的全精度训练，目标预算计约2000万累积训练令牌。在21个区间内收集指标，产生126个种子-区间观测值。重复测量方差分析显示，验证损失、验证困惑度和滚动波动性存在统计显著的区间效应。描述性轨迹揭示了早期快速改善，随后在后期训练区间出现非单调退化。平均验证损失从初始化的8.3552下降至约400万令牌时的2.7996，但在最终检查点上升至3.9010。验证困惑度遵循相同模式，训练早期急剧下降后后期回升。衍生遥测进一步显示了验证损失的反复回退，且根据预定义标准，区间汇总证据未表明存在稳定阶段。这些发现表明，计算感知语言模型评估应考察训练轨迹而非仅依赖终点指标。在受限计算设置下，额外令牌暴露可能增加计算成本而不产生相称的泛化收益，且区间级遥测能揭示最终指标可能掩盖的不稳定性、回归和收益递减现象。