Large language models (LLMs) offer unprecedented text completion capabilities. As general models, they can fulfill a wide range of roles, including those of more specialized models. We assess the performance of GPT-4 and GPT-3.5 in zero shot, few shot and fine-tuned settings on the aspect-based sentiment analysis (ABSA) task. Fine-tuned GPT-3.5 achieves a state-of-the-art F1 score of 83.8 on the joint aspect term extraction and polarity classification task of the SemEval-2014 Task 4, improving upon InstructABSA [@scaria_instructabsa_2023] by 5.7%. However, this comes at the price of 1000 times more model parameters and thus increased inference cost. We discuss the the cost-performance trade-offs of different models, and analyze the typical errors that they make. Our results also indicate that detailed prompts improve performance in zero-shot and few-shot settings but are not necessary for fine-tuned models. This evidence is relevant for practioners that are faced with the choice of prompt engineering versus fine-tuning when using LLMs for ABSA.
翻译:大语言模型(LLMs)展现出前所未有的文本补全能力。作为通用模型,它们可胜任包括专业模型在内的多种角色。我们评估了GPT-4和GPT-3.5在零样本、少样本及微调设置下对方面情感分析任务的表现。在SemEval-2014任务4的联合方面术语提取与极性分类任务中,微调后的GPT-3.5取得83.8的F1分数,较InstructABSA提升5.7%,达到当前最优水平。然而,这一提升以模型参数量增加1000倍及相应推理成本上升为代价。我们探讨了不同模型的成本-性能权衡,并分析了其典型错误类型。研究结果还表明:详细提示在零样本与少样本环境下能提升性能,但对微调模型并非必要。该发现对使用LLM进行ABSA时面临提示工程与微调选择的实践者具有参考价值。