With the surge in realistic text tampering, detecting fraudulent text in images has gained prominence for maintaining information security. However, the high costs associated with professional text manipulation and annotation limit the availability of real-world datasets, with most relying on synthetic tampering, which inadequately replicates real-world tampering attributes. To address this issue, we present the Real Text Manipulation (RTM) dataset, encompassing 14,250 text images, which include 5,986 manually and 5,258 automatically tampered images, created using a variety of techniques, alongside 3,006 unaltered text images for evaluating solution stability. Our evaluations indicate that existing methods falter in text forgery detection on the RTM dataset. We propose a robust baseline solution featuring a Consistency-aware Aggregation Hub and a Gated Cross Neighborhood-attention Fusion module for efficient multi-modal information fusion, supplemented by a Tampered-Authentic Contrastive Learning module during training, enriching feature representation distinction. This framework, extendable to other dual-stream architectures, demonstrated notable localization performance improvements of 7.33% and 6.38% on manual and overall manipulations, respectively. Our contributions aim to propel advancements in real-world text tampering detection. Code and dataset will be made available at https://github.com/DrLuo/RTM
翻译:随着逼真文本篡改现象的激增,图像中欺诈文本的检测在维护信息安全方面日益重要。然而,专业文本篡改与标注的高昂成本限制了真实数据集的可用性,多数现有数据集依赖合成篡改手段,无法充分复现真实篡改属性。为应对这一问题,我们提出了真实文本篡改(RTM)数据集,包含14,250张文本图像,其中5,986张为手动篡改、5,258张为自动篡改图像(采用多种技术生成),以及3,006张未篡改文本图像(用于评估方案稳定性)。评估结果表明,现有方法在RTM数据集上的文本伪造检测能力不足。我们提出一种稳健的基线方案,包含一致性感知聚合枢纽(Consistency-aware Aggregation Hub)与门控交叉邻域注意力融合模块(Gated Cross Neighborhood-attention Fusion),用于高效多模态信息融合,并在训练阶段辅以真假对比学习模块(Tampered-Authentic Contrastive Learning)以增强特征表征区分性。该框架可扩展至其他双流架构,在手动篡改与整体篡改场景下分别实现了7.33%和6.38%的显著定位性能提升。我们的贡献旨在推动真实文本篡改检测领域的发展。代码与数据集将发布于https://github.com/DrLuo/RTM。