In operational scenarios, steganographers use sets of covers from various sensors and processing pipelines that differ significantly from those used by researchers to train steganalysis models. This leads to an inevitable performance gap when dealing with out-of-distribution covers, commonly referred to as Cover Source Mismatch (CSM). In this study, we consider the scenario where test images are processed using the same pipeline. However, knowledge regarding both the labels and the balance between cover and stego is missing. Our objective is to identify a training dataset that allows for maximum generalization to our target. By exploring a grid of processing pipelines fostering CSM, we discovered a geometrical metric based on the chordal distance between subspaces spanned by DCTr features, that exhibits high correlation with operational regret while being not affected by the cover-stego balance. Our contribution lies in the development of a strategy that enables the selection or derivation of customized training datasets, enhancing the overall generalization performance for a given target. Experimental validation highlights that our geometry-based optimization strategy outperforms traditional atomistic methods given reasonable assumptions. Additional resources are available at github.com/RonyAbecidan/LeveragingGeometrytoMitigateCSM.
翻译:在实际操作场景中,隐写分析师使用的封面图像来自各种传感器和处理流程,这些流程与研究人员训练隐写分析模型时使用的流程存在显著差异。这导致在处理分布外封面图像时出现不可避免的性能差距,通常称为封面源不匹配(Cover Source Mismatch, CSM)。在本研究中,我们考虑测试图像均采用相同流程处理的场景。然而,关于标签以及封面与隐写图像之间平衡的知识是缺失的。我们的目标是确定一个训练数据集,使其能够最大限度地泛化到我们的目标场景。通过探索一个诱发CSM的处理流程网格,我们发现了一种基于DCTr特征子空间之间的弦距离的几何度量,该度量与实际操作遗憾高度相关,同时不受封面-隐写平衡的影响。我们的贡献在于开发了一种策略,能够选择或推导定制化的训练数据集,从而增强给定目标的整体泛化性能。实验验证表明,在合理假设下,我们基于几何的优化策略优于传统的原子方法。补充资源可在github.com/RonyAbecidan/LeveragingGeometrytoMitigateCSM获取。