Code agents and empirical software engineering rely on public code datasets, yet these datasets lack verifiable quality guarantees. Static 'dataset cards' inform, but they are neither auditable nor do they offer statistical guarantees, making it difficult to attest to dataset quality. Teams build isolated, ad-hoc cleaning pipelines. This fragments effort and raises cost. We present SIEVE, a community-driven framework. It turns per-property checks into Confidence Cards-machine-readable, verifiable certificates with anytime-valid statistical bounds. We outline a research plan to bring SIEVE to maturity, replacing narrative cards with anytime-verifiable certification. This shift is expected to lower quality-assurance costs and increase trust in code-datasets.
翻译:代码智能体与实证软件工程依赖于公开的代码数据集,然而这些数据集缺乏可验证的质量保证。静态的“数据集卡片”虽能提供信息,但其既不可审计,也无法提供统计保证,导致难以确认数据集质量。各团队通常构建孤立且临时的数据清洗流程,这种分散化的努力既增加了成本,也降低了效率。本文提出SIEVE——一个社区驱动的框架,它将针对特定属性的检查转化为“置信度卡片”——一种机器可读、可验证且具备任意时间有效统计边界的认证凭证。我们进一步阐述了推动SIEVE走向成熟的研究路线,旨在以任意时间可验证的认证体系替代叙述性卡片。这一转变有望降低质量保证成本,并提升对代码数据集的信任度。