While LLMs have shown great success in understanding and generating text in traditional conversational settings, their potential for performing ill-defined complex tasks is largely under-studied. Indeed, we are yet to conduct comprehensive benchmarking studies with multiple LLMs that are exclusively focused on a complex task. However, conducting such benchmarking studies is challenging because of the large variations in LLMs' performance when different prompt types/styles are used and different degrees of detail are provided in the prompts. To address this issue, the paper proposes a general taxonomy that can be used to design prompts with specific properties in order to perform a wide range of complex tasks. This taxonomy will allow future benchmarking studies to report the specific categories of prompts used as part of the study, enabling meaningful comparisons across different studies. Also, by establishing a common standard through this taxonomy, researchers will be able to draw more accurate conclusions about LLMs' performance on a specific complex task.
翻译:尽管LLM在传统对话场景中理解与生成文本方面取得了显著成功,但其在模糊复杂任务中的潜力尚未得到充分研究。实际上,我们尚未开展专注于复杂任务的、涉及多LLM的综合性基准测试研究。然而,由于不同提示类型/风格及提示细节程度会导致LLM性能出现显著差异,此类基准测试的实施面临挑战。为解决该问题,本文提出了一种通用分类法,可用于设计具有特定属性的提示,以支持广泛复杂任务的执行。该分类法将使未来的基准测试研究能够报告研究中使用的具体提示类别,从而促进不同研究间的有意义对比。同时,通过该分类法建立统一标准,研究者将能更准确地得出关于LLM在特定复杂任务上表现的结论。