Data science has become increasingly essential for the production of official statistics, as it enables the automated collection, processing, and analysis of large amounts of data. With such data science practices in place, it enables more timely, more insightful and more flexible reporting. However, the quality and integrity of data-science-driven statistics rely on the accuracy and reliability of the data sources and the machine learning techniques that support them. In particular, changes in data sources are inevitable to occur and pose significant risks that are crucial to address in the context of machine learning for official statistics. This paper gives an overview of the main risks, liabilities, and uncertainties associated with changing data sources in the context of machine learning for official statistics. We provide a checklist of the most prevalent origins and causes of changing data sources; not only on a technical level but also regarding ownership, ethics, regulation, and public perception. Next, we highlight the repercussions of changing data sources on statistical reporting. These include technical effects such as concept drift, bias, availability, validity, accuracy and completeness, but also the neutrality and potential discontinuation of the statistical offering. We offer a few important precautionary measures, such as enhancing robustness in both data sourcing and statistical techniques, and thorough monitoring. In doing so, machine learning-based official statistics can maintain integrity, reliability, consistency, and relevance in policy-making, decision-making, and public discourse.
翻译:数据科学在官方统计生产中的重要性日益凸显,因为它实现了海量数据的自动化采集、处理与分析。借助这些数据科学实践,统计报告能够更及时、更深入且更灵活地呈现。然而,基于数据科学的统计质量与完整性,取决于数据源的准确性和可靠性,以及支撑它们的机器学习技术。特别是在官方统计的机器学习应用中,数据源的变更不可避免,且会带来亟需应对的重大风险。本文概述了机器学习时代官方统计中数据源变更的主要风险、责任与不确定性因素。我们从技术层面以及所有权、伦理、法规与公众认知层面,梳理了数据源变更最常见的根源与成因清单。随后,我们重点阐述了数据源变更对统计报告的影响,包括概念漂移、偏差、可用性、有效性、准确性与完整性等技术效应,以及统计产品的中立性与潜在中断风险。我们提出若干重要预防措施,例如增强数据获取与统计技术的鲁棒性,并实施严密监测。通过上述举措,基于机器学习的官方统计能够在政策制定、决策与公共讨论中保持完整性、可靠性、一致性与相关性。