Recent developments in large language models (LLMs) have shown promise in enhancing the capabilities of natural language processing (NLP). Despite these successes, there remains a dearth of research dedicated to the NLP problem-solving abilities of LLMs. To fill the gap in this area, we present a unique benchmarking dataset, NLPBench, comprising 378 college-level NLP questions spanning various NLP topics sourced from Yale University's prior final exams. NLPBench includes questions with context, in which multiple sub-questions share the same public information, and diverse question types, including multiple choice, short answer, and math. Our evaluation, centered on LLMs such as GPT-3.5/4, PaLM-2, and LLAMA-2, incorporates advanced prompting strategies like the chain-of-thought (CoT) and tree-of-thought (ToT). Our study reveals that the effectiveness of the advanced prompting strategies can be inconsistent, occasionally damaging LLM performance, especially in smaller models like the LLAMA-2 (13b). Furthermore, our manual assessment illuminated specific shortcomings in LLMs' scientific problem-solving skills, with weaknesses in logical decomposition and reasoning notably affecting results.
翻译:近期大语言模型(LLM)的发展在增强自然语言处理(NLP)能力方面展现出巨大潜力。然而,针对LLM解决NLP问题能力的研究仍较为匮乏。为填补这一空白,我们提出了一个独特的基准测试数据集NLPBench,包含来自耶鲁大学以往期末考试的378道大学级别NLP问题,涵盖多个NLP主题。NLPBench包含带有上下文的问题(多个子问题共享同一公开信息)以及多种题型(选择题、简答题和数学题)。我们以GPT-3.5/4、PaLM-2和LLAMA-2等LLM为核心评估对象,结合了链式思维(CoT)和树状思维(ToT)等高级提示策略。研究表明,高级提示策略的有效性可能不稳定,有时甚至会损害LLM的性能,尤其是在LLAMA-2(13b)等较小模型中。此外,我们的手动评估揭示了LLM在科学问题解决能力方面的特定缺陷,其中逻辑分解和推理能力的不足显著影响了结果。