Paper publications are no longer the only form of research product. Due to recent initiatives by publication venues and funding institutions, open access datasets and software products are increasingly considered research products and URIs to these products are growing more prevalent in scholarly publications. However, as with all URIs, resources found on the live Web are not permanent. Archivists and institutions including Software Heritage, Internet Archive, and Zenodo are working to preserve data and software products as valuable parts of reproducibility, a cornerstone of scientific research. While some hosting platforms are well-known and can be identified with regular expressions, there are a vast number of smaller, more niche hosting platforms utilized by researchers to host their data and software. If it is not feasible to manually identify all hosting platforms used by researchers, how can we identify URIs to open-access data and software (OADS) to aid in their preservation? We used a hybrid classifier to classify URIs as OADS URIs and non-OADS URIs. We found that URIs to Git hosting platforms (GHPs) including GitHub, GitLab, SourceForge, and Bitbucket accounted for 33\% of OADS URIs. Non-GHP OADS URIs are distributed across almost 50,000 unique hostnames. We determined that using a hybrid classifier allows for the identification of OADS URIs in less common hosting platforms which can benefit discoverability for preserving datasets and software products as research products for reproducibility.
翻译:论文出版物已不再是研究成果的唯一形式。由于近期出版平台和资助机构的倡议,开放获取数据集和软件产品日益被视为研究成果,而指向这些产品的统一资源标识符(URI)在学术出版物中也越来越普遍。然而,与所有URI一样,互联网上的资源并非永久存在。包括Software Heritage、Internet Archive和Zenodo在内的档案馆及其机构正在努力保存数据和软件产品,将其作为可重复性(这一科学研究的基石)的重要组成部分。虽然一些托管平台广为人知,可以通过正则表达式进行识别,但研究人员用来托管其数据和软件的还有大量规模较小、定位更专业的平台。如果手动识别研究人员所使用的所有托管平台并不可行,那么我们如何才能识别指向开放获取数据与软件(OADS)的URI,以协助对这些产品和资源的保存呢?我们采用了一种混合分类器,将URI分类为OADS URI和非OADS URI。研究发现,指向包括GitHub、GitLab、SourceForge和Bitbucket在内的Git托管平台(GHP)的URI占OADS URI的33%。非GHP的OADS URI分布在近50,000个不同的主机名上。我们确定,使用混合分类器能够识别较为小众托管平台中的OADS URI,这有助于提高发现能力,从而将数据集和软件产品作为研究产品加以保存以保障可重复性。