In the current Large Language Model (LLM) ecosystem, creators have little agency over how their data is used, and LLM users may find themselves unknowingly plagiarizing existing sources. Attribution of LLM-generated text to LLM input data could help with these challenges, but so far we have more questions than answers: what elements of LLM outputs require attribution, what goals should it serve, how should it be implemented? We contribute a human-centric data attribution framework, which situates the attribution problem within the broader data economy. Specific use cases for attribution, such as creative writing assistance or fact-checking, can be specified via a set of parameters (including stakeholder objectives and implementation criteria). These criteria are up for negotiation by the relevant stakeholder groups: creators, LLM users, and their intermediaries (publishers, platforms, AI companies). The outcome of domain-specific negotiations can be implemented and tested for whether the stakeholder goals are achieved. The proposed approach provides a bridge between methodological NLP work on data attribution, governance work on policy interventions, and economic analysis of creator incentives for a sustainable equilibrium in the data economy.
翻译:在当前大型语言模型(LLM)生态系统中,创作者对自身数据的使用方式几乎毫无控制权,而LLM用户也可能在无意中剽窃已有来源。对LLM生成文本进行输入数据归因有助于应对这些挑战,但迄今为止我们面临的问题远多于答案:LLM输出的哪些要素需要归因?归因应服务于何种目标?应如何实施?我们提出一种以人为本的数据归因框架,将归因问题置于更广泛的数据经济背景下。针对特定归因用例(如创意写作辅助或事实核查),可通过一组参数(包括利益相关者目标与实施标准)进行具体定义。这些标准由相关利益群体协商决定:创作者、LLM用户及其中介方(出版商、平台、人工智能公司)。特定领域协商的结果可被实施,并检验是否实现了利益相关者目标。本方法为以下领域搭建了桥梁:数据归因的方法论NLP研究、政策干预的治理工作,以及数据经济可持续均衡中创作者激励的经济学分析。