In an era of rapidly evolving internet technology, the surge in multimodal content, including videos, has expanded the horizons of online communication. However, the detection of toxic content in this diverse landscape, particularly in low-resource code-mixed languages, remains a critical challenge. While substantial research has addressed toxic content detection in textual data, the realm of video content, especially in non-English languages, has been relatively underexplored. This paper addresses this research gap by introducing a benchmark dataset, the first of its kind, consisting of 931 videos with 4021 code-mixed Hindi-English utterances collected from YouTube. Each utterance within this dataset has been meticulously annotated for toxicity, severity, and sentiment labels. We have developed an advanced Multimodal Multitask framework built for Toxicity detection in Video Content by leveraging Large Language Models (LLMs), crafted for the primary objective along with the additional tasks of conducting sentiment and severity analysis. ToxVidLLM incorporates three key modules the Encoder module, Cross-Modal Synchronization module, and Multitask module crafting a generic multimodal LLM customized for intricate video classification tasks. Our experiments reveal that incorporating multiple modalities from the videos substantially enhances the performance of toxic content detection by achieving an Accuracy and Weighted F1 score of 94.29% and 94.35%, respectively.
翻译:在互联网技术飞速发展的时代,包括视频在内的多模态内容激增,拓宽了在线通信的视野。然而,在这一多样化环境中,尤其是在资源匮乏的混合语言中,检测有害内容仍然是一个关键挑战。尽管已有大量研究针对文本数据中的有害内容检测,但视频内容领域,尤其是在非英语语言中,相对而言尚未得到充分探索。本文通过引入一个首创的基准数据集来填补这一研究空白,该数据集包含从YouTube收集的931个视频及其4021条印地语-英语混合语言话语。数据集中的每条话语都经过精心标注,包含毒性、严重程度和情感标签。我们开发了一个先进的多模态多任务框架,该框架利用大语言模型(LLMs)构建,主要用于视频内容中的毒性检测,并额外执行情感和严重程度分析任务。ToxVidLLM包含三个关键模块:编码器模块、跨模态同步模块和多任务模块,共同构建了一个专为复杂视频分类任务定制的通用多模态大语言模型。我们的实验表明,结合视频中的多种模态能显著提升有害内容检测的性能,准确率和加权F1分数分别达到了94.29%和94.35%。