Pushing the Limits of ChatGPT on NLP Tasks

Despite the success of ChatGPT, its performances on most NLP tasks are still well below the supervised baselines. In this work, we looked into the causes, and discovered that its subpar performance was caused by the following factors: (1) token limit in the prompt does not allow for the full utilization of the supervised datasets; (2) mismatch between the generation nature of ChatGPT and NLP tasks; (3) intrinsic pitfalls of LLMs models, e.g., hallucination, overly focus on certain keywords, etc. In this work, we propose a collection of general modules to address these issues, in an attempt to push the limits of ChatGPT on NLP tasks. Our proposed modules include (1) a one-input-multiple-prompts strategy that employs multiple prompts for one input to accommodate more demonstrations; (2) using fine-tuned models for better demonstration retrieval; (3) transforming tasks to formats that are more tailored to the generation nature; (4) employing reasoning strategies that are tailored to addressing the task-specific complexity; (5) the self-verification strategy to address the hallucination issue of LLMs; (6) the paraphrase strategy to improve the robustness of model predictions. We conduct experiments on 21 datasets of 10 representative NLP tasks, including question answering, commonsense reasoning, natural language inference, sentiment analysis, named entity recognition, entity-relation extraction, event extraction, dependency parsing, semantic role labeling, and part-of-speech tagging. Using the proposed assemble of techniques, we are able to significantly boost the performance of ChatGPT on the selected NLP tasks, achieving performances comparable to or better than supervised baselines, or even existing SOTA performances.

翻译：尽管ChatGPT取得了成功，但它在大多数自然语言处理（NLP）任务上的表现仍远低于监督基线。在本研究中，我们探究了其原因，发现其性能欠佳主要由以下因素导致：（1）提示（prompt）中的令牌限制使得监督数据集无法被充分利用；（2）ChatGPT的生成式特性与NLP任务之间存在不匹配；（3）大语言模型（LLMs）的固有缺陷，例如幻觉、过度关注特定关键词等。为此，我们提出了一系列通用模块来解决这些问题，试图突破ChatGPT在NLP任务上的性能极限。所提出的模块包括：（1）一种“一输入多提示”策略，通过为同一输入使用多个提示以容纳更多示例；（2）使用微调模型以优化示例检索；（3）将任务转换为更贴合生成式特性的格式；（4）采用针对任务特定复杂度定制的推理策略；（5）利用自验证策略应对LLMs的幻觉问题；（6）采用释义策略以提高模型预测的鲁棒性。我们在涵盖10个代表性NLP任务的21个数据集上进行了实验，这些任务包括：问答、常识推理、自然语言推理、情感分析、命名实体识别、实体关系抽取、事件抽取、依存句法分析、语义角色标注和词性标注。通过应用上述技术组合，我们显著提升了ChatGPT在所选NLP任务上的性能，达到了与监督基线相当甚至更优的水平，部分任务甚至超越了现有最先进（SOTA）表现。