In the digital era, accidental exposure of sensitive information such as API keys, tokens, and credentials is a growing security threat. While most prior work focuses on detecting secrets in source code, leakage in software issue reports remains largely unexplored. This study fills that gap through a large-scale analysis and a practical detection pipeline for exposed secrets in GitHub issues. Our pipeline combines regular expression-based extraction with large language model (LLM) based contextual classification to detect real secrets and reduce false positives. We build a benchmark of 54,148 instances from public GitHub issues, including 5,881 manually verified true secrets. Using this dataset, we evaluate entropy-based baselines and keyword heuristics used by prior secret detection tools, classical machine learning, deep learning, and LLM-based methods. Regex and entropy based approaches achieve high recall but poor precision, while smaller models such as RoBERTa and CodeBERT greatly improve performance (F1 = 92.70%). Proprietary models like GPT-4o perform moderately in few-shot settings (F1 = 80.13%), and fine-tuned open-source larger LLMs such as Qwen and LLaMA reach up to 94.49% F1. Finally, we also validate our approach on 178 real-world GitHub repositories, achieving an F1-score of 81.6% which demonstrates our approach's strong ability to generalize to in-the-wild scenarios.
翻译:在数字时代,API密钥、令牌和凭证等敏感信息的意外暴露已成为日益严重的安全威胁。尽管先前的研究大多集中于检测源代码中的秘密信息,但软件问题报告中的泄露问题在很大程度上尚未得到充分探索。本研究通过大规模分析以及针对GitHub问题中暴露秘密的实用检测流程填补了这一空白。我们的流程结合了基于正则表达式的提取与基于大型语言模型(LLM)的上下文分类,以检测真实秘密并降低误报率。我们从公开的GitHub问题中构建了一个包含54,148个实例的基准数据集,其中包括5,881个经人工验证的真实秘密。利用该数据集,我们评估了先前秘密检测工具使用的基于熵的基线方法和关键词启发式方法、经典机器学习、深度学习以及基于LLM的方法。基于正则表达式和熵的方法实现了高召回率但精度较差,而RoBERTa和CodeBERT等较小模型显著提升了性能(F1 = 92.70%)。GPT-4o等专有模型在少样本设置中表现中等(F1 = 80.13%),而经过微调的开源大型LLM(如Qwen和LLaMA)的F1分数最高可达94.49%。最后,我们在178个真实世界的GitHub仓库上验证了我们的方法,取得了81.6%的F1分数,这表明我们的方法在泛化到实际场景方面具有很强的能力。