Towards Evaluating Large Language Models on Sarcasm Understanding

In the era of large language models (LLMs), the task of ``System I''~-~the fast, unconscious, and intuitive tasks, e.g., sentiment analysis, text classification, etc., have been argued to be successfully solved. However, sarcasm, as a subtle linguistic phenomenon, often employs rhetorical devices like hyperbole and figuration to convey true sentiments and intentions, involving a higher level of abstraction than sentiment analysis. There is growing concern that the argument about LLMs' success may not be fully tenable when considering sarcasm understanding. To address this question, we select eleven SOTA LLMs and eight SOTA pre-trained language models (PLMs) and present comprehensive evaluations on six widely used benchmark datasets through different prompting approaches, i.e., zero-shot input/output (IO) prompting, few-shot IO prompting, chain of thought (CoT) prompting. Our results highlight three key findings: (1) current LLMs underperform supervised PLMs based sarcasm detection baselines across six sarcasm benchmarks. This suggests that significant efforts are still required to improve LLMs' understanding of human sarcasm. (2) GPT-4 consistently and significantly outperforms other LLMs across various prompting methods, with an average improvement of 14.0\%$\uparrow$. Claude 3 and ChatGPT demonstrate the next best performance after GPT-4. (3) Few-shot IO prompting method outperforms the other two methods: zero-shot IO and few-shot CoT. The reason is that sarcasm detection, being a holistic, intuitive, and non-rational cognitive process, is argued not to adhere to step-by-step logical reasoning, making CoT less effective in understanding sarcasm compared to its effectiveness in mathematical reasoning tasks.

翻译：在大语言模型（LLM）时代，人们认为“系统I”任务——即快速、无意识、直觉性的任务，如情感分析、文本分类等——已得到成功解决。然而，讽刺作为一种微妙的语言现象，常使用夸张、比喻等修辞手法来传达真实的情感和意图，其涉及比情感分析更高层次的抽象。考虑到对讽刺的理解，关于LLM成功的论点可能并非完全站得住脚，这一担忧日益增长。为探讨此问题，我们选取了十一种最先进的LLM和八种最先进的预训练语言模型（PLM），并通过不同的提示方法——即零样本输入/输出（IO）提示、少样本IO提示、思维链（CoT）提示——在六个广泛使用的基准数据集上进行了全面评估。我们的结果凸显了三个关键发现：（1）在六个讽刺基准测试中，当前LLM的表现均低于基于监督PLM的讽刺检测基线。这表明仍需付出巨大努力来提升LLM对人类讽刺的理解能力。（2）GPT-4在各种提示方法中持续且显著地优于其他LLM，平均提升达14.0%$\uparrow$。Claude 3和ChatGPT在GPT-4之后表现出次优性能。（3）少样本IO提示方法优于其他两种方法：零样本IO和少样本CoT。原因在于，讽刺检测作为一种整体性、直觉性且非理性的认知过程，被认为不遵循逐步的逻辑推理，这使得CoT在理解讽刺方面的效果不如其在数学推理任务中有效。