Assessing the effectiveness of large language models (LLMs) in performing different tasks is crucial for understanding their strengths and weaknesses. This paper presents the Hierarchical Prompting Taxonomy (HPT), grounded on human cognitive principles and designed to assess LLMs by examining the cognitive demands of various tasks. The HPT uses the Hierarchical Prompting Framework (HPF), a prompt selection framework that organizes five distinct prompting strategies by their cognitive load on LLMs. This study introduces the Hierarchical Prompting Index (HPI) to measure task complexity, which demonstrates LLMs' abilities across different datasets and serves as a universal metric for task complexity. The HPT offers a reliable method for evaluating LLMs' problem-solving skills in diverse scenarios, leading to clearer conclusions. Extensive experiments with multiple datasets and LLMs show that the HPF enhances LLM performance by 2\% to 63\% compared to standard benchmark datasets, confirming the effectiveness of the HPT. To support future research in this domain, the implementations of HPT and HPF are publicly available
翻译:评估大型语言模型在不同任务中的表现对于理解其优势与局限至关重要。本文提出基于人类认知原理构建的层次化提示分类法,旨在通过考察各类任务的认知需求来评估大型语言模型。该方法采用层次化提示框架——一种根据认知负荷对五种不同提示策略进行组织的提示选择框架。本研究引入层次化提示指数来衡量任务复杂度,该指数不仅展示了大型语言模型在多个数据集上的能力表现,还可作为任务复杂度的通用度量标准。层次化提示分类法为评估大型语言模型在多样化场景中的问题解决能力提供了可靠方法,从而得出更明确的结论。在多数据集和多种大型语言模型上的大量实验表明,相较于标准基准数据集,层次化提示框架能将模型性能提升2%至63%,验证了该分类法的有效性。为支持该领域的后续研究,层次化提示分类法与层次化提示框架的实现代码已公开发布。