The GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks, showcasing their strong understanding and reasoning capabilities. However, their robustness and abilities to handle various complexities of the open world have yet to be explored, which is especially crucial in assessing the stability of models and is a key aspect of trustworthy AI. In this study, we perform a comprehensive experimental analysis of GPT-3.5, exploring its robustness using 21 datasets (about 116K test samples) with 66 text transformations from TextFlint that cover 9 popular Natural Language Understanding (NLU) tasks. Our findings indicate that while GPT-3.5 outperforms existing fine-tuned models on some tasks, it still encounters significant robustness degradation, such as its average performance dropping by up to 35.74\% and 43.59\% in natural language inference and sentiment analysis tasks, respectively. We also show that GPT-3.5 faces some specific robustness challenges, including robustness instability, prompt sensitivity, and number sensitivity. These insights are valuable for understanding its limitations and guiding future research in addressing these challenges to enhance GPT-3.5's overall performance and generalization abilities.
翻译:GPT-3.5 模型在多种自然语言处理任务中展现了卓越性能,彰显其强大的理解与推理能力。然而,其应对开放世界复杂性的鲁棒性与泛化能力尚未得到充分探究,而这对于评估模型稳定性至关重要,亦是可信人工智能的关键维度。本研究对GPT-3.5进行了全面的实验分析,采用来自TextFlint的66种文本变换(覆盖9类主流自然语言理解任务)在21个数据集(约11.6万测试样本)上系统考察其鲁棒性。研究发现:尽管GPT-3.5在某些任务上优于现有微调模型,但仍存在显著的鲁棒性退化现象——例如在自然语言推断与情感分析任务中,其平均性能分别下降高达35.74%和43.59%。我们还发现GPT-3.5面临特定的鲁棒性挑战,包括鲁棒性不稳定、提示敏感性和数值敏感性。这些发现为理解GPT-3.5的局限性提供了重要启示,并指引了未来应对这些挑战、提升其整体性能与泛化能力的研究方向。