The widespread use of social media platforms like Twitter and Facebook has enabled people of all ages to share their thoughts and experiences, leading to an immense accumulation of user-generated content. However, alongside the benefits, these platforms also face the challenge of managing hate speech and offensive content, which can undermine rational discourse and threaten democratic values. As a result, there is a growing need for automated methods to detect and mitigate such content, especially given the complexity of conversations that may require contextual analysis across multiple languages, including code-mixed languages like Hinglish, German-English, and Bangla. We participated in the English task where we have to classify English tweets into two categories namely Hate and Offensive and Non Hate-Offensive. In this work, we experiment with state-of-the-art large language models like GPT-3.5 Turbo via prompting to classify tweets into Hate and Offensive or Non Hate-Offensive. In this study, we evaluate the performance of a classification model using Macro-F1 scores across three distinct runs. The Macro-F1 score, which balances precision and recall across all classes, is used as the primary metric for model evaluation. The scores obtained are 0.756 for run 1, 0.751 for run 2, and 0.754 for run 3, indicating a high level of performance with minimal variance among the runs. The results suggest that the model consistently performs well in terms of precision and recall, with run 1 showing the highest performance. These findings highlight the robustness and reliability of the model across different runs.
翻译:Twitter和Facebook等社交媒体平台的广泛使用,使得各年龄段人群能够分享其观点与经历,从而导致用户生成内容的海量积累。然而,在享受这些益处的同时,此类平台也面临着管理仇恨言论与冒犯性内容的挑战,这些内容可能破坏理性对话并威胁民主价值观。因此,对自动化方法以检测和缓解此类内容的需求日益增长,特别是考虑到对话的复杂性往往需要跨多种语言(包括像印度英语、德英混合语和孟加拉语这样的语码混合语言)进行语境分析。我们参与了英语分类任务,需将英文推文划分为"仇恨与冒犯性"和"非仇恨冒犯性"两类。本研究通过提示工程,实验性地采用GPT-3.5 Turbo等前沿大语言模型对推文进行二元分类。我们在三次独立实验中采用Macro-F1分数评估分类模型性能。Macro-F1分数通过平衡所有类别的精确率与召回率,被用作模型评估的核心指标。三次实验获得的分数分别为:实验1得0.756,实验2得0.751,实验3得0.754,这表明模型在保持高性能的同时具有极低的实验间方差。结果证明该模型在精确率与召回率方面表现稳定,其中实验1展现出最优性能。这些发现凸显了模型在不同实验中的鲁棒性与可靠性。