Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks. However, their practical application in high-stake domains, such as fraud and abuse detection, remains an area that requires further exploration. The existing applications often narrowly focus on specific tasks like toxicity or hate speech detection. In this paper, we present a comprehensive benchmark suite designed to assess the performance of LLMs in identifying and mitigating fraudulent and abusive language across various real-world scenarios. Our benchmark encompasses a diverse set of tasks, including detecting spam emails, hate speech, misogynistic language, and more. We evaluated several state-of-the-art LLMs, including models from Anthropic, Mistral AI, and the AI21 family, to provide a comprehensive assessment of their capabilities in this critical domain. The results indicate that while LLMs exhibit proficient baseline performance in individual fraud and abuse detection tasks, their performance varies considerably across tasks, particularly struggling with tasks that demand nuanced pragmatic reasoning, such as identifying diverse forms of misogynistic language. These findings have important implications for the responsible development and deployment of LLMs in high-risk applications. Our benchmark suite can serve as a tool for researchers and practitioners to systematically evaluate LLMs for multi-task fraud detection and drive the creation of more robust, trustworthy, and ethically-aligned systems for fraud and abuse detection.
翻译:大语言模型(LLMs)在自然语言处理任务中展现出卓越的能力。然而,其在欺诈与滥用检测等高风险领域中的实际应用仍需进一步探索。现有应用往往局限于特定任务,如毒性内容或仇恨言论检测。本文提出了一套综合性基准测试套件,旨在评估LLMs在不同现实场景中识别与缓解欺诈性及滥用性语言的表现。我们的基准涵盖多样化任务集,包括垃圾邮件检测、仇恨言论识别、厌女语言辨识等。我们评估了包括Anthropic、Mistral AI及AI21系列在内的多个前沿LLMs,以全面评估其在这一关键领域的能力。结果表明,尽管LLMs在单一欺诈与滥用检测任务中展现出良好的基线性能,但其在不同任务间的表现差异显著,尤其在需要细致语用推理的任务(如识别多种形式的厌女语言)上存在明显不足。这些发现对高风险应用中LLMs的负责任开发与部署具有重要意义。我们的基准套件可作为研究者和从业者系统评估多任务欺诈检测LLMs的工具,并推动构建更鲁棒、可信且符合伦理规范的欺诈与滥用检测系统。