Translators as Invisible Teachers of AI: Copyright, Translation Memory, and the Political Economy of Linguistic Data

This paper examines how the labour of translators has been transformed into foundational data capital for the age of artificial intelligence (AI). Translation memories (TM) and parallel corpora preserve a one-to-one correspondence between source and target text and therefore constitute extraordinarily valuable supervised training data for machine translation. The development of statistical machine translation (SMT), neural machine translation (NMT), the Transformer architecture, and multilingual large language models (LLMs) cannot be disentangled from the accumulation of such translation data. And yet, translators' renditions have been bought as deliverables under contract, segmented as technical objects, and processed as "information analysis" data under copyright law -- losing their moral, creative, and economic attribution to the translators who produced them. The paper develops two concepts to capture this process. The first is appropriation without consumption: a mode of use in which works are not read, viewed, or listened to, but only mined for statistical features -- a use that is legitimated under Article 30-4 of the Japanese Copyright Act. The second is the invisible teacherisation of translators: the process by which translators, through the construction of translation memories, post-editing, and quality assessment, have functioned as teachers of AI without recognition as such. Drawing on the data supply chain that runs from translators through language service providers (LSPs) and platforms to model developers, on a comparative reading of Japanese, European, and United States legal frameworks, on the distinction between open and proprietary AI models, and on the premium status that human-generated data has acquired in the era of model collapse, the paper asks what translators are actually afraid of, and points toward concrete directions for redistributive design.

翻译：本文考察了译者的劳动如何被转化为人工智能时代的基础数据资本。翻译记忆库和平行语料库保留了源文本与目标文本之间的一一对应关系，因而构成了机器翻译中极具价值的监督训练数据。统计机器翻译、神经机器翻译、Transformer架构以及多语言大语言模型的发展，与这类翻译数据的积累密不可分。然而，译者的译作已被作为合同交付物购买、作为技术对象分割、并在版权法下作为"信息分析"数据处理——失去了对创作它们的译者的道义、创意和经济归属。本文提出了两个概念来捕捉这一过程。第一个是"无消费的挪用"：一种作品不被阅读、观看或聆听，而仅被挖掘统计特征的使用模式——这种使用在日本《著作权法》第30-4条下被合法化。第二个是"译者的无形教师化"：译者通过翻译记忆库构建、译后编辑和质量评估，在未被承认的情况下充当人工智能教师的过程。本文基于从译者经语言服务提供商和平台到模型开发者的数据供应链、对日本、欧盟和美国法律框架的比较解读、开放与专有AI模型的区分、以及在模型崩溃时代人类生成数据获得的溢价地位，探讨了译者真正恐惧的是什么，并指出了再分配设计的具体方向。