Assessing the effectiveness of large language models (LLMs) in performing different tasks is crucial for understanding their strengths and weaknesses. This paper presents Hierarchical Prompting Taxonomy (HPT), grounded on human cognitive principles and designed to assess LLMs by examining the cognitive demands of various tasks. The HPT utilizes the Hierarchical Prompting Framework (HPF), which structures five unique prompting strategies in a hierarchical order based on their cognitive requirement on LLMs when compared to human mental capabilities. It assesses the complexity of tasks with the Hierarchical Prompting Index (HPI), which demonstrates the cognitive competencies of LLMs across diverse datasets and offers insights into the cognitive demands that datasets place on different LLMs. This approach enables a comprehensive evaluation of an LLMs problem solving abilities and the intricacy of a dataset, offering a standardized metric for task complexity. Extensive experiments with multiple datasets and LLMs show that HPF enhances LLM performance by 2% to 63% compared to baseline performance, with GSM8k being the most cognitively complex task among reasoning and coding tasks with an average HPI of 3.20 confirming the effectiveness of HPT. To support future research and reproducibility in this domain, the implementations of HPT and HPF are available here.
翻译:评估大语言模型(LLMs)在不同任务中的表现对于理解其优势与局限至关重要。本文提出基于人类认知原理的层次化提示分类法(HPT),旨在通过分析各类任务的认知需求来评估LLMs。HPT采用层次化提示框架(HPF),该框架参照人类心智能力,将五种独特的提示策略按其对LLMs的认知需求进行层次化组织。该方法通过层次化提示指数(HPI)评估任务复杂度,该指数展示了LLMs在多样化数据集上的认知能力,并揭示了不同数据集对各类LLMs的认知需求。这一方法能全面评估LLMs的问题解决能力与数据集的复杂程度,为任务复杂度提供了标准化度量指标。在多数据集和LLMs上的大量实验表明,相较于基线性能,HPF能将LLM性能提升2%至63%,其中GSM8k在推理与编程任务中认知复杂度最高,平均HPI达3.20,验证了HPT的有效性。为支持该领域的后续研究与可复现性,HPT与HPF的实现代码已公开。