Accelerating Training Speed of Tiny Recursive Models with Curriculum Guided Adaptive Recursion

Background: Recursive reasoning models achieve strong performance through iterative refinement, allowing small networks to match large language models. However, training is computationally expensive, often requiring 36 GPU-hours for Sudoku extreme. Existing models use fixed recursion depth and uniform supervision weighting, leading to inefficient training. Objectives: We propose CGAR (Curriculum-Guided Adaptive Recursion), applying curriculum learning to architectural depth. CGAR introduces Progressive Depth Curriculum (PDC) to dynamically adjust recursion depth and Hierarchical Supervision Weighting (HSW) to apply exponentially decaying importance to supervision steps. Methods: PDC implements a three-stage schedule transitioning from shallow (2, 1) to full depth (6, 3) configurations, providing 41.4% FLOPs reduction. HSW applies exponential decay to supervision steps, achieving 40% gradient variance reduction and accelerated convergence. Results: On Sudoku-Extreme, CGAR achieves 1.71x training speedup (10.93 to 6.38 hours) with only a 0.63% accuracy drop (86.65% to 86.02%). PDC alone achieves 2.26x speedup with 85.47% accuracy, showing a Pareto improvement in efficiency and quality. HSW provides 1.61x speedup. CGAR-trained models show superior inference efficiency with 100% halting accuracy and 11% fewer reasoning steps. Conclusions: CGAR enables efficient training of recursive models on modest hardware. By treating depth as a scheduled parameter, it achieves substantial savings and prevents overfitting, making these models practical for neurosymbolic AI and program synthesis. https://github.com/Kaleemullahqasim/CGAR and huggingface.co/Kaleemullah/trm-cgar-sudoku.

翻译：背景：递归推理模型通过迭代优化实现优异性能，使得小型网络能够媲美大型语言模型。然而其训练过程计算成本高昂，在Sudoku extreme任务上通常需要36 GPU小时。现有模型采用固定递归深度与均匀监督权重，导致训练效率低下。目标：我们提出CGAR（课程引导的自适应递归）方法，将课程学习应用于架构深度设计。CGAR引入渐进深度课程（PDC）动态调整递归深度，并采用分层监督加权（HSW）对监督步骤施加指数衰减的重要性权重。方法：PDC实施三阶段调度策略，从浅层（2, 1）逐步过渡到完整深度（6, 3）配置，实现41.4%的浮点运算量削减。HSW对监督步骤应用指数衰减，达成40%的梯度方差降低并加速收敛。结果：在Sudoku-Extreme任务中，CGAR实现1.71倍训练加速（10.93小时降至6.38小时），仅伴随0.63%准确率下降（86.65%至86.02%）。单独使用PDC可实现2.26倍加速及85.47%准确率，展现效率与质量的帕累托改进。HSW单独提供1.61倍加速。CGAR训练模型展现出卓越的推理效率，实现100%停止准确率及11%的推理步骤缩减。结论：CGAR使得在普通硬件上高效训练递归模型成为可能。通过将深度作为可调度参数，该方法实现显著资源节约并防止过拟合，为神经符号人工智能和程序合成领域的实际应用奠定基础。项目地址：https://github.com/Kaleemullahqasim/CGAR 及 huggingface.co/Kaleemullah/trm-cgar-sudoku。