Despite significant advances in Large Reasoning Models (LRMs) driven by reinforcement learning with verifiable rewards (RLVR), this paradigm is fundamentally limited in specialized or novel domains where such supervision is prohibitively expensive or unavailable, posing a key challenge for test-time adaptation. While existing test-time methods offer a potential solution, they are constrained by learning from static query sets, risking overfitting to textual patterns. To address this gap, we introduce Test-Time Variational Synthesis (TTVS), a novel framework that enables LRMs to self-evolve by dynamically augmenting the training stream from unlabeled test queries. TTVS comprises two synergistic modules: (1) Online Variational Synthesis, which transforms static test queries into a dynamic stream of diverse, semantically-equivalent variations, enforcing the model to learn underlying problem logic rather than superficial patterns; (2) Test-time Hybrid Exploration, which balances accuracy-driven exploitation with consistency-driven exploration across synthetic variants. Extensive experiments show TTVS yields superior performance across eight model architectures. Notably, using only unlabeled test-time data, TTVS not only surpasses other test-time adaptation methods but also outperforms state-of-the-art supervised RL-based techniques trained on vast, high-quality labeled data.
翻译:尽管基于可验证奖励的强化学习(RLVR)驱动的大型推理模型(LRMs)取得了显著进展,但在监督信号成本高昂或不可用的专业或新颖领域中,该范式本质上存在局限性,对测试时自适应构成了关键挑战。现有测试时方法虽提供潜在解决方案,但受限于从静态查询集学习,存在过拟合文本模式的风险。为解决此问题,我们提出测试时变分合成(TTVS),一种使LRMs能够通过动态增强来自未标注测试查询的训练流实现自我进化的新型框架。TTVS包含两个协同模块:(1)在线变分合成,将静态测试查询转化为包含多样语义等价变体的动态流,迫使模型学习问题底层逻辑而非表层模式;(2)测试时混合探索,在合成变体间平衡基于准确率的利用与基于一致性的探索。大量实验表明,TTVS在八种模型架构上均取得优越性能。值得注意的是,仅使用未标注测试时数据,TTVS不仅超越了其他测试时自适应方法,还优于基于大规模高质量标注数据训练的最先进监督强化学习技术。