Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models

Assessing the effectiveness of large language models (LLMs) in addressing diverse tasks is essential for comprehending their strengths and weaknesses. Conventional evaluation techniques typically apply a single prompting strategy uniformly across datasets, not considering the varying degrees of task complexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomy that employs a Hierarchical Prompt Framework (HPF) composed of five unique prompting strategies, arranged from the simplest to the most complex, to assess LLMs more precisely and to offer a clearer perspective. This taxonomy assigns a score, called the Hierarchical Prompting Score (HP-Score), to datasets as well as LLMs based on the rules of the taxonomy, providing a nuanced understanding of their ability to solve diverse tasks and offering a universal measure of task complexity. Additionally, we introduce the Adaptive Hierarchical Prompt framework, which automates the selection of appropriate prompting strategies for each task. This study compares manual and adaptive hierarchical prompt frameworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B, Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA), IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectiveness of HPT, providing a reliable way to compare different tasks and LLM capabilities. This paper leads to the development of a universal evaluation metric that can be used to evaluate both the complexity of the datasets and the capabilities of LLMs. The implementation of both manual HPF and adaptive HPF is publicly available.

翻译：评估大型语言模型（LLMs）在处理多样化任务时的有效性，对于理解其优势与局限至关重要。传统的评估方法通常在数据集上统一应用单一的提示策略，而未考虑任务复杂度的差异。本文提出了层次化提示分类法（HPT），该分类法采用一个由五种独特提示策略组成的层次化提示框架（HPF），这些策略按从最简单到最复杂的顺序排列，旨在更精确地评估LLMs并提供更清晰的视角。该分类法根据其规则为数据集以及LLMs分配一个分数，称为层次化提示分数（HP-Score），从而细致地理解它们解决多样化任务的能力，并提供任务复杂度的通用度量。此外，我们引入了自适应层次化提示框架，该框架能自动为每项任务选择合适的提示策略。本研究使用四个经过指令微调的LLM（即Llama 3 8B、Phi 3 3.8B、Mistral 7B和Gemma 7B），在四个数据集（BoolQ、CommonSenseQA (CSQA)、IWSLT-2017 en-fr (IWSLT)和SamSum）上，比较了手动与自适应层次化提示框架。实验证明了HPT的有效性，为比较不同任务和LLM能力提供了一种可靠的方法。本文推动了一种通用评估指标的发展，该指标可用于同时评估数据集的复杂度和LLM的能力。手动HPF和自适应HPF的实现均已公开。