Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models

Assessing the effectiveness of large language models (LLMs) in addressing diverse tasks is essential for comprehending their strengths and weaknesses. Conventional evaluation techniques typically apply a single prompting strategy uniformly across datasets, not considering the varying degrees of task complexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomy that employs a Hierarchical Prompt Framework (HPF) composed of five unique prompting strategies, arranged from the simplest to the most complex, to assess LLMs more precisely and to offer a clearer perspective. This taxonomy assigns a score, called the Hierarchical Prompting Score (HP-Score), to datasets as well as LLMs based on the rules of the taxonomy, providing a nuanced understanding of their ability to solve diverse tasks and offering a universal measure of task complexity. Additionally, we introduce the Adaptive Hierarchical Prompt framework, which automates the selection of appropriate prompting strategies for each task. This study compares manual and adaptive hierarchical prompt frameworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B, Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA), IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectiveness of HPT, providing a reliable way to compare different tasks and LLM capabilities. This paper leads to the development of a universal evaluation metric that can be used to evaluate both the complexity of the datasets and the capabilities of LLMs. The implementation of both manual HPF and adaptive HPF is publicly available.

翻译：评估大语言模型在处理多样化任务时的有效性，对于理解其优势与局限至关重要。传统的评估方法通常在整个数据集上统一应用单一的提示策略，未能考虑任务复杂度的差异。本文提出层次化提示分类法，该分类法采用一个由五种独特提示策略构成的层次化提示框架，这些策略按从最简单到最复杂的顺序排列，旨在更精确地评估大语言模型并提供更清晰的视角。该分类法根据其规则为数据集以及大语言模型分配一个分数，称为层次化提示分数，从而细致地理解其解决多样化任务的能力，并提供任务复杂度的通用度量。此外，我们引入了自适应层次化提示框架，该框架能自动为每项任务选择合适的提示策略。本研究使用四种经过指令微调的大语言模型，即 Llama 3 8B、Phi 3 3.8B、Mistral 7B 和 Gemma 7B，在四个数据集上比较了手动与自适应层次化提示框架：BoolQ、CommonSenseQA、IWSLT-2017 en-fr 和 SamSum。实验证明了层次化提示分类法的有效性，为比较不同任务和大语言模型能力提供了一种可靠的方法。本文促成了一个通用评估指标的开发，该指标可用于评估数据集的复杂度以及大语言模型的能力。手动层次化提示框架与自适应层次化提示框架的实现均已公开。