While LLMs have shown great success in understanding and generating text in traditional conversational settings, their potential for performing ill-defined complex tasks is largely under-studied. Indeed, we are yet to conduct comprehensive benchmarking studies with multiple LLMs that are exclusively focused on a complex task. However, conducting such benchmarking studies is challenging because of the large variations in LLMs' performance when different prompt types/styles are used and different degrees of detail are provided in the prompts. To address this issue, the paper proposes a general taxonomy that can be used to design prompts with specific properties in order to perform a wide range of complex tasks. This taxonomy will allow future benchmarking studies to report the specific categories of prompts used as part of the study, enabling meaningful comparisons across different studies. Also, by establishing a common standard through this taxonomy, researchers will be able to draw more accurate conclusions about LLMs' performance on a specific complex task.
翻译:尽管大语言模型(LLM)在传统对话场景中展现出卓越的文本理解与生成能力,但其在完成非良定义复杂任务方面的潜力仍缺乏充分研究。当前学界尚未开展专攻复杂任务的跨模型基准测试研究。然而,由于采用不同提示类型/风格及细节程度时LLM性能存在显著差异,此类基准测试的开展面临重大挑战。针对该问题,本文提出一种通用分类体系,可用于设计具有特定属性的提示,从而支持广泛复杂任务的执行。该分类体系将使得未来基准测试研究能够报告具体使用的提示类别,实现跨研究的有效比较。此外,通过该分类建立统一标准,研究者将能更准确地推断LLM在特定复杂任务中的性能表现。