Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features achieved the best performance, with 72.5\% accuracy and 0.691 macro F1, transformer models underperformed, which was attributed to input truncation and insufficient training data. We present COVA-X, an expanded dataset of 10,985 conversations spanning eight elder-targeted scam categories, produced by an improved generation pipeline addressing contamination, label mismatch, stage-direction bleed, and prompt-design failures from the first iteration. Retraining all classifiers on the expanded dataset yields the central finding of this work: Longformer now surpasses XGBoost on all evaluation metrics, achieving 79.71\% accuracy and 0.7786 macro F1 compared with 78.43\% and 0.7563 for XGBoost. This directly confirms that transformer models require larger conversational corpora to realize their contextual advantages. We additionally document a quality life-cycle including a 12.7$\times$ improvement in label correction rate, from 49.8\% to 3.9\%, an architectural intervention reducing virtual-kidnapping artifact rates from 67.1\% to 46.5\%, and a per-scam-type outcome analysis showing that scam categories modulate results in mechanism-consistent ways. A pre/post-cleanup sensitivity analysis confirms that dataset refinement recovers genuine label-relevant signal across all three classifier architectures.
翻译:我们先前的工作引入了COVA,这是一个包含3,201条标注对话的合成多轮对话短信诈骗数据集,建立了八种模型的基线检测基准。尽管基于TF-IDF特征的XGBoost取得了最佳性能,准确率达72.5%,宏F1值为0.691,但Transformer模型表现不佳,这归因于输入截断和训练数据不足。我们提出了COVA-X,这是一个扩展数据集,包含10,985条对话,涵盖八种针对老年人的诈骗类别,通过改进的生成流程生成,解决了第一次迭代中的污染、标签不匹配、阶段指引泄露和提示设计失败等问题。在扩展数据集上重新训练所有分类器,得到了本研究的核心发现:Longformer现在在所有评估指标上均超越XGBoost,准确率达79.71%,宏F1为0.7786,而XGBoost分别为78.43%和0.7563。这直接证实了Transformer模型需要更大的对话语料库才能发挥其上下文优势。我们还记录了质量生命周期,包括标签校正率提升12.7倍(从49.8%降至3.9%)、一项架构干预将虚拟绑架伪造率从67.1%降至46.5%,以及按诈骗类型进行的结果分析,显示诈骗类别以机制一致的方式调节结果。前后清理的敏感性分析证实,数据集优化在所有三种分类器架构上恢复了真正的标签相关信号。