Machine learning models are increasingly used for software security tasks. These models are commonly trained and evaluated on large Internet-derived datasets, which often contain duplicated or highly similar samples. When such samples are split across training and test sets, data leakage may occur, allowing models to memorize patterns instead of learning to generalize. We investigate duplication in a widely used benchmark dataset of hard coded secrets and show how data leakage can substantially inflate the reported performance of AI-based secret detectors, resulting in a misleading picture of their real-world effectiveness.
翻译:机器学习模型在软件安全任务中的应用日益广泛。这些模型通常基于大规模互联网衍生数据集进行训练和评估,而这些数据集往往包含重复或高度相似的样本。当此类样本被分割至训练集和测试集时,可能发生数据泄露,导致模型倾向于记忆模式而非学习泛化能力。本研究针对一个广泛使用的硬编码秘密基准数据集中的重复现象展开调查,揭示了数据泄露如何显著夸大基于人工智能的秘密检测器所报告的性能指标,从而对其实际应用效果产生误导性评估。