The rapid advancement of general-purpose AI models has increased concerns about copyright infringement in training data, yet current regulatory frameworks remain predominantly reactive rather than proactive. This paper examines the regulatory landscape of AI training data governance in major jurisdictions, including the EU, the United States, and the Asia-Pacific region. It also identifies critical gaps in enforcement mechanisms that threaten both creator rights and the sustainability of AI development. Through analysis of major cases we identified critical gaps in pre-training data filtering. Existing solutions such as transparency tools, perceptual hashing, and access control mechanisms address only specific aspects of the problem and cannot prevent initial copyright violations. We identify two fundamental challenges: pre-training license collection and content filtering, which faces the impossibility of comprehensive copyright management at scale, and verification mechanisms, which lack tools to confirm filtering prevented infringement. We propose a multilayered filtering pipeline that combines access control, content verification, machine learning classifiers, and continuous database cross-referencing to shift copyright protection from post-training detection to pre-training prevention. This approach offers a pathway toward protecting creator rights while enabling continued AI innovation.
翻译:通用人工智能模型的快速发展加剧了人们对训练数据中版权侵权问题的担忧,然而当前的监管框架仍主要处于被动应对而非主动预防的状态。本文考察了包括欧盟、美国和亚太地区在内的主要司法管辖区在人工智能训练数据治理方面的监管格局,并指出了执法机制中存在的关键缺陷,这些缺陷既威胁创作者权利,也危及人工智能发展的可持续性。通过对主要案例的分析,我们发现了预训练数据过滤环节的关键漏洞。现有解决方案如透明度工具、感知哈希和访问控制机制仅能解决特定层面的问题,无法防止初始的版权侵权行为。我们识别出两个根本性挑战:一是面临大规模全面版权管理不可行性的预训练许可收集与内容过滤;二是缺乏能够确认过滤措施已防止侵权行为的验证机制。我们提出一种多层过滤管道,该管道结合访问控制、内容验证、机器学习分类器和持续数据库交叉比对,从而将版权保护从训练后检测转向训练前预防。这一方法为在保护创作者权利的同时持续推进人工智能创新提供了一条可行路径。