This article seeks to determine the extent to which the principle of persistence is observed by repositories and the organizations that operate them. We also evaluate the impact that negative repository persistence levels may be having on the scholarly record. We do this by interrogating and combining data about European repositories from several repository registries and web scraped sources, including the Internet Archive's Wayback Machine, thereby creating a unique dataset of historic repository locations and their OAI-PMH endpoints. We then use this data as the basis for text mining CORE, a vast corpus of scholarly outputs, to determine the extent to which impersistent European repository content has permeated the scholarly literature. Our findings indicate over a fifth of European repositories (> 20%) could be classified as 'dead', with an even greater proportion (> 40%) of the machine interfaces associated with these repositories similarly dead. Problematically, our analysis indicates that circa 12,000 unique scholarly works cite, refer to, or actively used this repository content, amounting to circa 19,000 unique repository locations, all of which are now unretrievable from their stated resource location. Partly owing to limitations in available repository registry data and the existence of 'zombie' repositories, there are reasons to conclude that the total number of scholarly works referring to dead repository content is far higher. We also find evidence of dead repository content entering the current scholarly record, a phenomenon we describe as 'dead on arrival' referencing. We consider the implications of these observations, proffer explanations, and propose possible policy interventions to address the issue of repository persistence. Our dataset also enables us to make several observations about the nature of impersistent repositories, their profile, and their decay rate.
翻译:本文旨在探究仓储机构及其运营组织对持久性原则的遵守程度,并评估仓储持久性不足对学术记录可能产生的影响。为此,我们通过整合多个仓储注册机构的数据及网络抓取来源(包括互联网档案馆的 Wayback Machine),构建了一个包含历史仓储位置及其 OAI-PMH 端点的独特数据集。基于此数据集,我们对大规模学术成果语料库 CORE 进行文本挖掘,以量化欧洲非持久性仓储内容在学术文献中的渗透程度。研究发现:超过五分之一的欧洲仓储(>20%)可被归类为“失效仓储”,与之关联的机器接口失效比例更高(>40%)。值得关注的是,分析表明约有 12,000 篇独立学术成果引用、参考或曾使用过这些仓储内容,涉及约 19,000 个仓储地址,而所有这些资源目前均无法从原始声明位置获取。由于现有仓储注册数据的局限性及“僵尸仓储”的存在,有理由认为引用失效仓储内容的学术成果总数远高于此。研究还发现失效仓储内容进入当前学术记录的证据,我们将此现象称为“即死引用”。本文讨论了这些发现的深层含义,提出解释性观点,并建议通过政策干预提升仓储持久性。此外,基于数据集的分析还揭示了非持久性仓储的特征属性与衰变规律。