Recently, tampered text detection has attracted increasing attention due to its essential role in information security. Although existing methods can detect the tampered text region, the interpretation of such detection remains unclear, making the prediction unreliable. To address this black-box problem, we propose to explain the basis of tampered text detection with natural language via large multimodal models. To fill the data gap for this task, we propose a large-scale, comprehensive dataset, ETTD, which contains both pixel-level annotations indicating the tampered text region and natural language annotations describing the anomaly of the tampered text. Multiple methods are employed to improve the quality of the proposed data. For example, a fused mask prompt is proposed to reduce confusion when querying GPT4o to generate anomaly descriptions. By weighting the input image with the mask annotation, the tampered region can be clearly indicated and the content in and around the tampered region can also be preserved. We also propose prompting GPT4o to recognize tampered texts and filtering out the responses with low OCR accuracy, which can effectively improve annotation quality in an automatic manner. To further improve explainable tampered text detection, we propose a simple yet effective model called TTD, which benefits from improved fine-grained perception by paying attention to the suspected region with auxiliary reference grounding query. Extensive experiments on both the ETTD dataset and the public dataset have verified the effectiveness of the proposed methods. In-depth analysis is also provided to inspire further research. The dataset and code will be made publicly available.
翻译:近年来,篡改文本检测因其在信息安全中的关键作用而日益受到关注。尽管现有方法能够检测篡改文本区域,但此类检测的解释仍不明确,导致预测结果不可靠。为解决这一黑盒问题,我们提出利用多模态大模型通过自然语言解释篡改文本检测的依据。为填补该任务的数据空白,我们提出了一个大规模、综合性数据集ETTD,其中既包含指示篡改文本区域的像素级标注,也包含描述篡改文本异常的自然语言标注。我们采用多种方法提升所提数据的质量。例如,提出融合掩码提示以减少查询GPT4o生成异常描述时的混淆。通过对输入图像进行掩码标注加权,可以清晰指示篡改区域,同时保留篡改区域内及周边的文本内容。我们还提出通过提示GPT4o识别篡改文本,并过滤OCR准确率较低的响应,从而以自动化方式有效提升标注质量。为进一步改进可解释篡改文本检测,我们提出一种简单而有效的模型TTD,该模型通过借助辅助参考定位查询关注可疑区域,从而提升细粒度感知能力。在ETTD数据集和公开数据集上的大量实验验证了所提方法的有效性。本文还提供了深入分析以启发后续研究。数据集与代码将公开发布。