Credential leakage in public source code repositories poses a critical security threat, with over 23.8 million secrets exposed in 2024 alone. Existing detection tools suffer from high false-positive rates because rigid pattern matching and binary classification schemes fail to distinguish genuine credentials from placeholder or weak credentials. We propose a three-class classification framework that explicitly models placeholder or weak credentials as a distinct class, leveraging CodeBERT-based semantic understanding combined with character-level pattern recognition. We evaluate our approach on a newly constructed dataset of 9,426 samples spanning 10 programming languages. Our model achieves a Matthews Correlation Coefficient of 0.86 and a macro F1-score of 0.90, achieving 93% recall and 89% precision for genuine credential leaks while reducing high severity alerts by 33.0% (from 373 to 250) without sacrificing security coverage. Compared to prior character-level approaches, our method improves placeholder or weak credential detection from 54% to 81% F1-score while maintaining strong cross language generalization, with 9 of 10 languages achieving F1 above 0.80 under leave-one-language-out evaluation.
翻译:公共源代码仓库中的凭证泄露构成严重安全威胁,仅2024年就暴露了超过2380万个机密。现有检测工具因僵化的模式匹配和二元分类方案难以区分真实凭证与占位符或弱凭证,导致高误报率。我们提出一种三类分类框架,将占位符或弱凭证显式建模为独立类别,结合基于CodeBERT的语义理解与字符级模式识别。我们在新构建的包含9426个样本、覆盖10种编程语言的数据集上评估该方法。模型马修斯相关系数达0.86,宏F1分数达0.90,对真实凭证泄露的召回率为93%、精确率为89%,同时将高严重性警报减少33.0%(从373降至250)且不牺牲安全覆盖范围。与先前字符级方法相比,我们的方法将占位符或弱凭证检测F1分数从54%提升至81%,并在留一语言评估中保持强大的跨语言泛化能力,10种语言中有9种F1分数超过0.80。