Explainable Tampered Text Detection via Multimodal Large Models

Recently, tampered text detection has attracted increasing attention due to its essential role in information security. Although existing methods can detect the tampered text region, the interpretation of such detection remains unclear, making the prediction unreliable. To address this black-box problem, we propose to explain the basis of tampered text detection with natural language via large multimodal models. To fill the data gap for this task, we propose a large-scale, comprehensive dataset, ETTD, which contains both pixel-level annotations indicating the tampered text region and natural language annotations describing the anomaly of the tampered text. Multiple methods are employed to improve the quality of the proposed data. For example, a fused mask prompt is proposed to reduce confusion when querying GPT4o to generate anomaly descriptions. By weighting the input image with the mask annotation, the tampered region can be clearly indicated and the content in and around the tampered region can also be preserved. We also propose prompting GPT4o to recognize tampered texts and filtering out the responses with low OCR accuracy, which can effectively improve annotation quality in an automatic manner. To further improve explainable tampered text detection, we propose a simple yet effective model called TTD, which benefits from improved fine-grained perception by paying attention to the suspected region with auxiliary reference grounding query. Extensive experiments on both the ETTD dataset and the public dataset have verified the effectiveness of the proposed methods. In-depth analysis is also provided to inspire further research. The dataset and code will be made publicly available.

翻译：近年来，篡改文本检测因其在信息安全中的关键作用而日益受到关注。尽管现有方法能够检测篡改文本区域，但此类检测的解释仍不明确，导致预测结果不可靠。为解决这一黑盒问题，我们提出利用多模态大模型通过自然语言解释篡改文本检测的依据。为填补该任务的数据空白，我们提出了一个大规模、综合性数据集ETTD，其中既包含指示篡改文本区域的像素级标注，也包含描述篡改文本异常的自然语言标注。我们采用多种方法提升所提数据的质量。例如，提出融合掩码提示以减少查询GPT4o生成异常描述时的混淆。通过对输入图像进行掩码标注加权，可以清晰指示篡改区域，同时保留篡改区域内及周边的文本内容。我们还提出通过提示GPT4o识别篡改文本，并过滤OCR准确率较低的响应，从而以自动化方式有效提升标注质量。为进一步改进可解释篡改文本检测，我们提出一种简单而有效的模型TTD，该模型通过借助辅助参考定位查询关注可疑区域，从而提升细粒度感知能力。在ETTD数据集和公开数据集上的大量实验验证了所提方法的有效性。本文还提供了深入分析以启发后续研究。数据集与代码将公开发布。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/