Ultra Low-Cost Two-Stage Multimodal System for Non-Normative Behavior Detection

from arxiv, to be appear in International Workshop on Coordination, Organizations, Institutions, Norms and Ethics for Governance of Multi-Agent Systems

The online community has increasingly been inundated by a toxic wave of harmful comments. In response to this growing challenge, we introduce a two-stage ultra-low-cost multimodal harmful behavior detection method designed to identify harmful comments and images with high precision and recall rates. We first utilize the CLIP-ViT model to transform tweets and images into embeddings, effectively capturing the intricate interplay of semantic meaning and subtle contextual clues within texts and images. Then in the second stage, the system feeds these embeddings into a conventional machine learning classifier like SVM or logistic regression, enabling the system to be trained rapidly and to perform inference at an ultra-low cost. By converting tweets into rich multimodal embeddings through the CLIP-ViT model and utilizing them to train conventional machine learning classifiers, our system is not only capable of detecting harmful textual information with near-perfect performance, achieving precision and recall rates above 99\% but also demonstrates the ability to zero-shot harmful images without additional training, thanks to its multimodal embedding input. This capability empowers our system to identify unseen harmful images without requiring extensive and costly image datasets. Additionally, our system quickly adapts to new harmful content; if a new harmful content pattern is identified, we can fine-tune the classifier with the corresponding tweets' embeddings to promptly update the system. This makes it well suited to addressing the ever-evolving nature of online harmfulness, providing online communities with a robust, generalizable, and cost-effective tool to safeguard their communities.

翻译：在线社区日益受到有害评论的毒害冲击。为应对这一日益严峻的挑战，我们提出了一种两阶段超低成本多模态有害行为检测方法，旨在以高精度和高召回率识别有害评论与图像。首先，我们利用CLIP-ViT模型将推文和图像转换为嵌入向量，有效捕捉文本与图像中语义含义与细微上下文线索的复杂交互。在第二阶段，系统将这些嵌入向量输入传统机器学习分类器（如SVM或逻辑回归），从而支持快速训练并以超低成本实现推理。通过CLIP-ViT模型将推文转化为丰富的多模态嵌入，并利用这些嵌入训练传统机器学习分类器，我们的系统不仅能够以近乎完美的性能检测有害文本信息（精度与召回率均超过99%），而且得益于多模态嵌入输入，无需额外训练即可零样本识别有害图像。这一能力使系统能够识别未见过的有害图像，而无需依赖庞大且昂贵的图像数据集。此外，系统能快速适应新型有害内容：一旦识别出新的有害内容模式，我们即可利用对应推文的嵌入微调分类器，及时更新系统。这使其非常适合应对在线有害内容不断演变的特性，为在线社区提供了一种鲁棒、可泛化且经济高效的安全防护工具。