With the recent surge and exponential growth of social media usage, scrutinizing social media content for the presence of any hateful content is of utmost importance. Researchers have been diligently working since the past decade on distinguishing between content that promotes hatred and content that does not. Traditionally, the main focus has been on analyzing textual content. However, recent research attempts have also commenced into the identification of audio-based content. Nevertheless, studies have shown that relying solely on audio or text-based content may be ineffective, as recent upsurge indicates that individuals often employ sarcasm in their speech and writing. To overcome these challenges, we present an approach to identify whether a speech promotes hate or not utilizing both audio and textual representations. Our methodology is based on the Transformer framework that incorporates both audio and text sampling, accompanied by our very own layer called "Attentive Fusion". The results of our study surpassed previous state-of-the-art techniques, achieving an impressive macro F1 score of 0.927 on the Test Set.
翻译:随着社交媒体使用的激增和指数级增长,审查社交媒體内容中是否存在仇恨内容是至关重要的。过去十年来,研究者们一直致力于区分煽动仇恨的内容与正常内容。传统上,主要关注点集中在分析文本内容上,但最近的研究尝试也开始转向基于音频内容的识别。然而,研究表明,仅依赖音频或文本内容可能效果不佳,因为近期趋势显示人们经常在言语和写作中使用讽刺手法。为克服这些挑战,我们提出了一种利用音频和文本两种表征来判断语音是否煽动仇恨的方法。我们的方法基于Transformer框架,融合了音频和文本采样,并结合了我们独创的"关注融合"层。研究结果超越了此前最先进的技术,在测试集上取得了令人瞩目的宏平均F1分数0.927。