While Large Language Models (LLMs) have demonstrated proficiency in handling complex queries, much of the past work has depended on extensively annotated datasets by human experts. However, this reliance on fully-supervised annotations poses scalability challenges, particularly as models and data requirements grow. To mitigate this, we explore the potential of enhancing LLMs' reasoning abilities with minimal human supervision. In this work, we introduce self-reinforcement, which begins with Supervised Fine-Tuning (SFT) of the model using a small collection of annotated questions. Then it iteratively improves LLMs by learning from the differences in responses from the SFT and unfinetuned models on unlabeled questions. Our approach provides an efficient approach without relying heavily on extensive human-annotated explanations. However, current reasoning benchmarks typically only include golden-reference answers or rationales. Therefore, we present \textsc{PuzzleBen}, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales across various domains, such as brainteasers, puzzles, riddles, parajumbles, and critical reasoning tasks. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities. Our experiments underscore the significance of \textsc{PuzzleBen}, as well as the effectiveness of our methodology as a promising direction in future endeavors. Our dataset and code will be published soon on \texttt{Anonymity Link}.
翻译:尽管大型语言模型(LLMs)在处理复杂查询方面已展现出卓越能力,但以往研究大多依赖人类专家大规模标注的数据集。然而,这种完全监督式标注的依赖会带来扩展性挑战,尤其随着模型规模和数据需求的增长。为缓解这一问题,我们探索了以最少人工监督增强LLMs推理能力的可能性。本文提出自强化方法:首先通过少量带标注问题对模型进行监督微调(SFT),继而利用SFT模型与未微调模型对未标注问题的响应差异,迭代式提升LLMs性能。该方法无需依赖大量人工标注解释即可实现高效优化。然而现有推理基准通常仅提供标准答案或推理过程。为此我们提出弱监督基准测试集\textsc{PuzzleBen},包含25,147道涵盖脑筋急转弯、谜题、字谜、段落重排及批判性推理等领域的复杂问题、答案及人工生成推理过程。该数据集独特之处在于包含10,000个未标注问题,使我们能够探索利用更少标注数据增强LLMs推理能力。实验证明了\textsc{PuzzleBen}的价值以及本文方法在促进未来研究方面的有效性。数据集与代码将发布于\texttt{匿名链接}。