Recent developments in large language models (LLMs) have shown promise in enhancing the capabilities of natural language processing (NLP). Despite these successes, there remains a dearth of research dedicated to the NLP problem-solving abilities of LLMs. To fill the gap in this area, we present a unique benchmarking dataset, NLPBench, comprising 378 college-level NLP questions spanning various NLP topics sourced from Yale University's prior final exams. NLPBench includes questions with context, in which multiple sub-questions share the same public information, and diverse question types, including multiple choice, short answer, and math. Our evaluation, centered on LLMs such as GPT-3.5/4, PaLM-2, and LLAMA-2, incorporates advanced prompting strategies like the chain-of-thought (CoT) and tree-of-thought (ToT). Our study reveals that the effectiveness of the advanced prompting strategies can be inconsistent, occasionally damaging LLM performance, especially in smaller models like the LLAMA-2 (13b). Furthermore, our manual assessment illuminated specific shortcomings in LLMs' scientific problem-solving skills, with weaknesses in logical decomposition and reasoning notably affecting results.
翻译:近年来,大型语言模型(LLM)在提升自然语言处理(NLP)能力方面展现出显著潜力。尽管取得了这些成功,但针对LLM解决NLP问题的能力仍缺乏专门研究。为填补这一空白,我们提出了一个独特的基准测试数据集NLPBench,其中包含378个涵盖多种NLP主题的大学水平NLP问题,这些问题源自耶鲁大学历年期末考试。NLPBench包含带有上下文的问题,其中多个子问题共享相同的公共信息,并涵盖多种题型,包括选择题、简答题和数学题。我们的评估聚焦于GPT-3.5/4、PaLM-2和LLAMA-2等LLM,并采用了思维链(CoT)和思维树(ToT)等先进提示策略。研究表明,先进提示策略的有效性可能不稳定,有时甚至会损害LLM的性能,尤其是在LLAMA-2(13b)等较小模型中。此外,我们的手动评估揭示了LLM在科学问题解决技能方面的具体缺陷,其中逻辑分解和推理能力的薄弱对结果影响显著。