An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection

Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features achieved the best performance, with 72.5\% accuracy and 0.691 macro F1, transformer models underperformed, which was attributed to input truncation and insufficient training data. We present COVA-X, an expanded dataset of 10,985 conversations spanning eight elder-targeted scam categories, produced by an improved generation pipeline addressing contamination, label mismatch, stage-direction bleed, and prompt-design failures from the first iteration. Retraining all classifiers on the expanded dataset yields the central finding of this work: Longformer now surpasses XGBoost on all evaluation metrics, achieving 79.71\% accuracy and 0.7786 macro F1 compared with 78.43\% and 0.7563 for XGBoost. This directly confirms that transformer models require larger conversational corpora to realize their contextual advantages. We additionally document a quality life-cycle including a 12.7$\times$ improvement in label correction rate, from 49.8\% to 3.9\%, an architectural intervention reducing virtual-kidnapping artifact rates from 67.1\% to 46.5\%, and a per-scam-type outcome analysis showing that scam categories modulate results in mechanism-consistent ways. A pre/post-cleanup sensitivity analysis confirms that dataset refinement recovers genuine label-relevant signal across all three classifier architectures.

翻译：我们先前的工作引入了COVA，这是一个包含3,201条标注对话的合成多轮对话短信诈骗数据集，建立了八种模型的基线检测基准。尽管基于TF-IDF特征的XGBoost取得了最佳性能，准确率达72.5%，宏F1值为0.691，但Transformer模型表现不佳，这归因于输入截断和训练数据不足。我们提出了COVA-X，这是一个扩展数据集，包含10,985条对话，涵盖八种针对老年人的诈骗类别，通过改进的生成流程生成，解决了第一次迭代中的污染、标签不匹配、阶段指引泄露和提示设计失败等问题。在扩展数据集上重新训练所有分类器，得到了本研究的核心发现：Longformer现在在所有评估指标上均超越XGBoost，准确率达79.71%，宏F1为0.7786，而XGBoost分别为78.43%和0.7563。这直接证实了Transformer模型需要更大的对话语料库才能发挥其上下文优势。我们还记录了质量生命周期，包括标签校正率提升12.7倍（从49.8%降至3.9%）、一项架构干预将虚拟绑架伪造率从67.1%降至46.5%，以及按诈骗类型进行的结果分析，显示诈骗类别以机制一致的方式调节结果。前后清理的敏感性分析证实，数据集优化在所有三种分类器架构上恢复了真正的标签相关信号。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

综述：多模态遗忘方法、数据集与基准

专知会员服务

16+阅读 · 7月10日

大视觉语言模型在多模态虚假新闻检测中的应用综述

专知会员服务

17+阅读 · 1月27日

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

专知会员服务

55+阅读 · 2024年7月24日

大模型上下文长度扩展中的检索增强技术简述

专知会员服务

26+阅读 · 2024年6月29日