Existing efforts to improve logical reasoning ability of language models have predominantly relied on supervised fine-tuning, hindering generalization to new domains and/or tasks. The development of Large Langauge Models (LLMs) has demonstrated the capacity of compressing abundant knowledge into a single proxy, enabling them to tackle multiple tasks effectively. Our preliminary experiments, nevertheless, show that LLMs do not show capability on logical reasoning. The performance of LLMs on logical reasoning benchmarks is far behind the existing state-of-the-art baselines. In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training, and activating it via in-context learning, which we termed as LogicLLM. Specifically, we devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion. The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM. Besides, we conduct extensive ablation studies to analyze the key factors in designing logic-oriented proxy tasks.
翻译:现有提升语言模型逻辑推理能力的研究主要依赖监督微调,这阻碍了模型对新领域和/或新任务的泛化能力。大语言模型(LLM)的发展表明,其能将海量知识压缩至单一代理模型中,从而有效处理多项任务。然而,我们的初步实验显示,LLM在逻辑推理方面并未展现出相应能力。在逻辑推理基准测试中,LLM的性能远低于现有最优基线模型。本文首次尝试探究通过自监督后训练融入逻辑知识,并借助上下文学习激活此类知识的可行性——我们将该方法命名为LogicLLM。具体而言,我们设计了MERIt模型的自回归目标变体,并将其与参数量从30亿到130亿的FLAN-T5和LLaMA两个LLM系列进行集成。在两项具有挑战性的逻辑推理基准测试上的结果验证了LogicLLM的有效性。此外,我们通过大量消融实验分析了逻辑导向代理任务设计的关键因素。