Optimizing Language Model's Reasoning Abilities with Weak Supervision

While Large Language Models (LLMs) have demonstrated proficiency in handling complex queries, much of the past work has depended on extensively annotated datasets by human experts. However, this reliance on fully-supervised annotations poses scalability challenges, particularly as models and data requirements grow. To mitigate this, we explore the potential of enhancing LLMs' reasoning abilities with minimal human supervision. In this work, we introduce self-reinforcement, which begins with Supervised Fine-Tuning (SFT) of the model using a small collection of annotated questions. Then it iteratively improves LLMs by learning from the differences in responses from the SFT and unfinetuned models on unlabeled questions. Our approach provides an efficient approach without relying heavily on extensive human-annotated explanations. However, current reasoning benchmarks typically only include golden-reference answers or rationales. Therefore, we present \textsc{PuzzleBen}, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales across various domains, such as brainteasers, puzzles, riddles, parajumbles, and critical reasoning tasks. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities. Our experiments underscore the significance of \textsc{PuzzleBen}, as well as the effectiveness of our methodology as a promising direction in future endeavors. Our dataset and code will be published soon on \texttt{Anonymity Link}.

翻译：尽管大型语言模型（LLMs）在处理复杂查询方面已展现出卓越能力，但以往研究大多依赖人类专家大规模标注的数据集。然而，这种完全监督式标注的依赖会带来扩展性挑战，尤其随着模型规模和数据需求的增长。为缓解这一问题，我们探索了以最少人工监督增强LLMs推理能力的可能性。本文提出自强化方法：首先通过少量带标注问题对模型进行监督微调（SFT），继而利用SFT模型与未微调模型对未标注问题的响应差异，迭代式提升LLMs性能。该方法无需依赖大量人工标注解释即可实现高效优化。然而现有推理基准通常仅提供标准答案或推理过程。为此我们提出弱监督基准测试集\textsc{PuzzleBen}，包含25,147道涵盖脑筋急转弯、谜题、字谜、段落重排及批判性推理等领域的复杂问题、答案及人工生成推理过程。该数据集独特之处在于包含10,000个未标注问题，使我们能够探索利用更少标注数据增强LLMs推理能力。实验证明了\textsc{PuzzleBen}的价值以及本文方法在促进未来研究方面的有效性。数据集与代码将发布于\texttt{匿名链接}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日