Large language models (LLMs) have demonstrated impressive performance on reasoning tasks, which can be further improved through few-shot prompting techniques. However, the current evaluation primarily focuses on carefully constructed benchmarks and neglects the consideration of real-world reasoning problems that present missing and contradictory conditions, known as ill-defined problems. Our observations suggest that existing few-shot prompting techniques are ineffective in such scenarios, often providing overconfident answers or hallucination. To further study this problem, we develop a benchmark called Problems with Missing and Contradictory conditions (PMC) and introduce two novel metrics to evaluate the performance of few-shot prompting methods in these scenarios. Our analysis using the PMC benchmark reveals a trade-off dilemma between the performance of mathematical reasoning for well-defined problems and the ability to recognize ill-defined problems. To address the challenges posed by PMC, we propose a novel few-shot prompting method called SMT-LIB Prompting (SLP), which utilizes the SMT-LIB language to model the problems instead of solving them directly. Subsequently, a double-check solving strategy checks the satisfiability and uniqueness of the solution and provides final feedback. Extensive experiments demonstrate the superiority of our SLP approach compared to existing few-shot prompting methods when dealing with problems with missing and contradictory conditions. We will open-source our benchmark and code to facilitate future research.
翻译:大语言模型在推理任务上展现出令人瞩目的性能,且可通过少样本提示技术进一步提升。然而,当前评估主要聚焦于精心构建的基准测试,忽视了现实推理问题中存在的条件缺失与矛盾(即病态问题)。我们的观察表明,现有少样本提示技术在此类场景中效果不佳,常给出过度自信的回答或产生幻觉。为深入探究该问题,我们构建了名为PMC(条件缺失与矛盾问题集)的基准测试,并引入两项新型评估指标,用于衡量少样本提示方法在病态场景下的表现。基于PMC基准测试的分析揭示了数学推理在良定义问题上的性能与识别病态问题的能力之间存在权衡困境。为应对PMC带来的挑战,我们提出了一种新型少样本提示方法——SMT-LIB提示法(SLP),该方法利用SMT-LIB语言对问题进行建模而非直接求解,随后通过双重检查求解策略验证解的可满足性与唯一性,并输出最终反馈。大量实验表明,在处理存在缺失与矛盾条件的问题时,我们的SLP方法显著优于现有少样本提示方法。我们将开源基准测试与代码以促进后续研究。