Pushing the Limits of ChatGPT on NLP Tasks

Despite the success of ChatGPT, its performances on most NLP tasks are still well below the supervised baselines. In this work, we looked into the causes, and discovered that its subpar performance was caused by the following factors: (1) token limit in the prompt does not allow for the full utilization of the supervised datasets; (2) mismatch between the generation nature of ChatGPT and NLP tasks; (3) intrinsic pitfalls of LLMs models, e.g., hallucination, overly focus on certain keywords, etc. In this work, we propose a collection of general modules to address these issues, in an attempt to push the limits of ChatGPT on NLP tasks. Our proposed modules include (1) a one-input-multiple-prompts strategy that employs multiple prompts for one input to accommodate more demonstrations; (2) using fine-tuned models for better demonstration retrieval; (3) transforming tasks to formats that are more tailored to the generation nature; (4) employing reasoning strategies that are tailored to addressing the task-specific complexity; (5) the self-verification strategy to address the hallucination issue of LLMs; (6) the paraphrase strategy to improve the robustness of model predictions. We conduct experiments on 21 datasets of 10 representative NLP tasks, including question answering, commonsense reasoning, natural language inference, sentiment analysis, named entity recognition, entity-relation extraction, event extraction, dependency parsing, semantic role labeling, and part-of-speech tagging. Using the proposed assemble of techniques, we are able to significantly boost the performance of ChatGPT on the selected NLP tasks, achieving performances comparable to or better than supervised baselines, or even existing SOTA performances.

翻译：尽管ChatGPT取得了成功，但其在大多数自然语言处理任务上的性能仍远低于有监督基线水平。本研究探究了其性能欠佳的原因，发现主要受以下因素影响：（1）提示中的令牌限制导致无法充分利用有监督数据集；（2）ChatGPT的生成式特性与NLP任务之间存在不匹配；（3）大语言模型的内在缺陷，如幻觉、过度关注特定关键词等。针对这些问题，我们提出了一套通用模块，旨在突破ChatGPT在NLP任务上的性能极限。所提模块包括：（1）采用单输入多提示策略，通过为单个输入设置多个提示来容纳更多示例；（2）利用微调模型进行更优的示例检索；（3）将任务转化为更适配生成式特性的格式；（4）采用针对任务特定复杂度的推理策略；（5）通过自验证策略解决大语言模型的幻觉问题；（6）使用释义策略提升模型预测的鲁棒性。我们在涵盖10项代表性NLP任务的21个数据集上进行了实验，包括问答、常识推理、自然语言推理、情感分析、命名实体识别、实体关系抽取、事件抽取、依存句法分析、语义角色标注和词性标注。通过应用上述技术组合，我们显著提升了ChatGPT在所选NLP任务上的性能，达到了与有监督基线相当甚至更优的水平，部分任务超越了现有最优结果。