Large Language Models (LLMs), such as ChatGPT, have achieved impressive milestones in natural language processing (NLP). Despite their impressive performance, the models are known to pose important risks. As these models are deployed in real-world applications, a systematic understanding of different risks posed by these models on tasks such as natural language inference (NLI), is much needed. In this paper, we define and formalize two distinct types of risk: decision risk and composite risk. We also propose a risk-centric evaluation framework, and four novel metrics, for assessing LLMs on these risks in both in-domain and out-of-domain settings. Finally, we propose a risk-adjusted calibration method called DwD for helping LLMs minimize these risks in an overall NLI architecture. Detailed experiments, using four NLI benchmarks, three baselines and two LLMs, including ChatGPT, show both the practical utility of the evaluation framework, and the efficacy of DwD in reducing decision and composite risk. For instance, when using DwD, an underlying LLM is able to address an extra 20.1% of low-risk inference tasks (but which the LLM erroneously deems high-risk without risk adjustment) and skip a further 19.8% of high-risk tasks, which would have been answered incorrectly.
翻译:大语言模型(LLMs),如ChatGPT,已在自然语言处理(NLP)领域取得显著成就。尽管性能卓越,但已知这些模型会带来重要风险。随着这些模型在现实应用中的部署,亟需系统理解它们在自然语言推理(NLI)等任务中带来的不同风险。本文定义并形式化了两种不同的风险类型:决策风险和复合风险。我们还提出了一种以风险为中心的评估框架以及四项新型指标,用于在领域内和领域外场景下评估LLM的此类风险。最后,我们提出了一种名为DwD的风险调整校准方法,以帮助LLM在整体NLI架构中最小化这些风险。使用四个NLI基准、三个基线模型和两个LLM(包括ChatGPT)进行的详细实验表明,该评估框架具有实际效用,且DwD在降低决策风险和复合风险方面效果显著。例如,使用DwD时,底层LLM能够额外处理20.1%的低风险推理任务(这些任务原本被该LLM在未进行风险调整时错误地判定为高风险),并进一步跳过19.8%原本会被错误回答的高风险任务。