Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks?

How can "weak teacher models" such as average human annotators or existing AI systems, effectively supervise LLMs to improve performance on hard reasoning tasks, especially those that challenge and requires expertise or daily practice from the teacher models? In this paper, we seek for empirical answers to this question by investigating various data-driven strategies that offer supervision data at different quality levels upon tasks of varying complexity. Two intuitive strategies emerge for teacher models to provide supervision during alignment training: 1) using lower-quality supervision from complete tasks that match the difficulty of the target reasoning tasks, and 2) leveraging higher-quality supervision from easier subtasks that are less challenging. Interestingly, we find that even when the outcome error rate for hard task supervision is high (e.g., 90\%), training on such data can outperform perfectly correct supervision on easier subtasks on multiple hard math benchmarks. We further identify a more critical factor influencing training performance: step-wise error rates, which indicate the severity of errors in solutions. Specifically, training on hard task supervision with the same outcome error rates but disparate step-wise error rates can lead to a 30\% accuracy gap on MATH benchmark. Our results also reveal that supplementing hard task supervision with the corresponding subtask supervision can yield notable performance improvements than simply combining rephrased hard full task supervision, suggesting new avenues for data augmentation. Data and code are released at \url{https://github.com/hexuan21/Weak-to-Strong}.

翻译：诸如普通人类标注者或现有AI系统等“弱教师模型”，如何有效监督大型语言模型以提升其在困难推理任务上的性能，尤其是那些对教师模型构成挑战且需要专业知识或日常实践的任务？本文通过研究多种数据驱动策略，在任务复杂度不同的情况下提供不同质量水平的监督数据，以寻求该问题的实证答案。在对其齐训练过程中，教师模型提供监督的两种直观策略应运而生：1）使用与目标推理任务难度匹配的完整任务中的低质量监督；2）利用来自难度较低、挑战性较小的较易子任务中的高质量监督。有趣的是，我们发现即使在困难任务监督的结果错误率较高（例如90%）的情况下，基于此类数据的训练在多个困难数学基准测试中仍能优于完全正确的较易子任务监督。我们进一步识别出影响训练性能的一个更关键因素：步骤级错误率，它反映了解决方案中错误的严重程度。具体而言，在结果错误率相同但步骤级错误率差异显著的困难任务监督上进行训练，可能导致在MATH基准测试上产生30%的准确率差距。我们的结果还表明，将困难任务监督与相应的子任务监督相结合，相较于简单组合重述的困难完整任务监督，能够带来显著的性能提升，这为数据增强提供了新的途径。数据和代码发布于\url{https://github.com/hexuan21/Weak-to-Strong}。