The deployment of Large Language Models (LLMs) in real-world applications presents both opportunities and challenges, particularly in multilingual and code-mixed communication settings. This research evaluates the performance of seven leading LLMs in sentiment analysis on a dataset derived from multilingual and code-mixed WhatsApp chats, including Swahili, English and Sheng. Our evaluation includes both quantitative analysis using metrics like F1 score and qualitative assessment of LLMs' explanations for their predictions. We find that, while Mistral-7b and Mixtral-8x7b achieved high F1 scores, they and other LLMs such as GPT-3.5-Turbo, Llama-2-70b, and Gemma-7b struggled with understanding linguistic and contextual nuances, as well as lack of transparency in their decision-making process as observed from their explanations. In contrast, GPT-4 and GPT-4-Turbo excelled in grasping diverse linguistic inputs and managing various contextual information, demonstrating high consistency with human alignment and transparency in their decision-making process. The LLMs however, encountered difficulties in incorporating cultural nuance especially in non-English settings with GPT-4s doing so inconsistently. The findings emphasize the necessity of continuous improvement of LLMs to effectively tackle the challenges of culturally nuanced, low-resource real-world settings and the need for developing evaluation benchmarks for capturing these issues.
翻译:大语言模型在真实世界应用中的部署既带来机遇也伴随挑战,尤其在多语言及语码混合的交流场景中。本研究评估了七种主流大语言模型在源自多语言及语码混合WhatsApp聊天(涵盖斯瓦希里语、英语和申格语)数据集上的情感分析性能。我们的评估既包括使用F1分数等指标的定量分析,也包括对模型预测解释的定性评估。研究发现,尽管Mistral-7b和Mixtral-8x7b取得了较高F1分数,但二者及GPT-3.5-Turbo、Llama-2-70b、Gemma-7b等其他模型在理解语言与语境细微差异方面存在困难,且从解释中可观察到决策过程缺乏透明度。相比之下,GPT-4和GPT-4-Turbo在掌握多样化语言输入及处理各类语境信息方面表现卓越,展现出与人类高度一致的对齐性及决策过程的透明度。然而,这些大语言模型在融入文化细微差异方面仍面临困难,尤其非英语场景中GPT-4系列模型的表现并不一致。研究结果强调了持续改进大语言模型以有效应对文化细腻、低资源真实场景挑战的必要性,以及开发能够捕捉这些问题的评估基准的重要性。