According to GitGuardian's monitoring of public GitHub repositories, the exposure of secrets (API keys and other credentials) increased two-fold in 2021 compared to 2020, totaling more than six million secrets. However, no benchmark dataset is publicly available for researchers and tool developers to evaluate secret detection tools that produce many false positive warnings. The goal of our paper is to aid researchers and tool developers in evaluating and improving secret detection tools by curating a benchmark dataset of secrets through a systematic collection of secrets from open-source repositories. We present a labeled dataset of source codes containing 97,479 secrets (of which 15,084 are true secrets) of various secret types extracted from 818 public GitHub repositories. The dataset covers 49 programming languages and 311 file types.
翻译:根据GitGuardian对公共GitHub仓库的监控,2021年公开暴露的秘密(API密钥及其他凭证)数量较2020年翻了一番,总计超过600万个。然而,目前尚无公开可用的基准数据集供研究人员和工具开发者评估会产生大量误报警告的秘密检测工具。本文旨在通过系统地从开源仓库中收集秘密,构建一个秘密基准数据集,以帮助研究人员和工具开发者评估和改进秘密检测工具。我们提出了一个包含97,479个秘密(其中15,084个为真实秘密)的源代码标注数据集,这些秘密涵盖多种类型,提取自818个公共GitHub仓库。该数据集覆盖49种编程语言和311种文件类型。