提示测试时缩放是一种强大的大语言模型推理数据增强方法 (Prompting Test-Time Scaling Is A Strong LLM Reasoning Data Augmentation)

Large language models (LLMs) have demonstrated impressive reasoning capabilities when provided with chain-of-thought exemplars, but curating large reasoning datasets remains laborious and resource-intensive. In this work, we introduce Prompting Test-Time Scaling (P-TTS), a simple yet effective inference-time data augmentation strategy for enhancing LLM reasoning through finetuning. Rather than collecting thousands or even millions of examples, P-TTS leverages a small pool of only 90 manually selected reasoning instances and systematically varies exemplar augmentation through principled instruction prompting intensities at test time to synthesize diverse reasoning trajectory contexts. Then we finetune the various sizes of Qwen-2.5 models on P-TTS data. Across a suite of mathematical reasoning AIME2024 & 25, MATH500, and GPQA-Diamond, our P-TTS-7B and 32B models outperform the prior competitive baselines like S1 and S1.1 (1K-shot), achieving absolute accuracy gains of +26.66% and +30.00% on AIME'24 (7B), and +13.34% and +6.67% on AIME'25 (7B); P-TTS-32B yields gains of +23.33% and +16.63% on AIME'24, and +26.63% and +3.33% on AIME'25 (vs. S1 and S1.1, respectively), with comparable or better performance on MATH500 and GPQA-Diamond. We further show that P-TTS enhances zero-shot generalization accuracy on out-of-domain reasoning benchmarks of Gaokao, Kaoyan, OlympiadBench, AMC23, GradeSchoolMath, and Minerva. Our analysis suggests that test-time scaling effectively explores the latent space of reasoning patterns, amplifying LLM problem-solving with minimal annotation overhead, and further unlocking the reasoning potential and capabilities of LLMs. Prompting Test-Time Scaling offers a practical, low-cost way to elicit LLM reasoning in resource-constrained or rapidly evolving domains.

翻译：大语言模型（LLMs）在提供思维链示例时展现出卓越的推理能力，但构建大规模推理数据集仍然费时费力。本文提出提示测试时缩放（P-TTS），这是一种简单而有效的推理时数据增强策略，通过微调提升LLM的推理能力。P-TTS无需收集成千上万的示例，仅利用90个手动筛选的推理实例，通过在测试时系统性地改变基于原则性指令提示强度的示例增强，合成多样化的推理轨迹上下文。随后，我们在P-TTS数据上对不同规模的Qwen-2.5模型进行微调。在一系列数学推理基准测试（AIME2024 & 25、MATH500和GPQA-Diamond）中，我们的P-TTS-7B和32B模型超越了先前具有竞争力的基线模型（如S1和S1.1（1K样本）），在AIME'24（7B）上分别实现了+26.66%和+30.00%的绝对准确率提升，在AIME'25（7B）上分别提升了+13.34%和+6.67%；P-TTS-32B在AIME'24上分别获得+23.33%和+16.63%的提升，在AIME'25上分别提升+26.63%和+3.33%（分别对比S1和S1.1），并在MATH500和GPQA-Diamond上取得相当或更优的性能。我们进一步证明，P-TTS在域外推理基准测试（包括Gaokao、Kaoyan、OlympiadBench、AMC23、GradeSchoolMath和Minerva）上提升了零样本泛化准确率。分析表明，测试时缩放有效探索了推理模式的潜在空间，以最小的标注开销放大了LLM的问题解决能力，进一步释放了LLM的推理潜力与能力。提示测试时缩放提供了一种实用、低成本的方法，可在资源受限或快速演进的领域中激发LLM的推理能力。