In the current Large Language Model (LLM) ecosystem, creators have little agency over how their data is used, and LLM users may find themselves unknowingly plagiarizing existing sources. Attribution of LLM-generated text to LLM input data could help with these challenges, but so far we have more questions than answers: what elements of LLM outputs require attribution, what goals should it serve, how should it be implemented? We contribute a human-centric data attribution framework, which situates the attribution problem within the broader data economy. Specific use cases for attribution, such as creative writing assistance or fact-checking, can be specified via a set of parameters (including stakeholder objectives and implementation criteria). These criteria are up for negotiation by the relevant stakeholder groups: creators, LLM users, and their intermediaries (publishers, platforms, AI companies). The outcome of domain-specific negotiations can be implemented and tested for whether the stakeholder goals are achieved. The proposed approach provides a bridge between methodological NLP work on data attribution, governance work on policy interventions, and economic analysis of creator incentives for a sustainable equilibrium in the data economy.
翻译:在当前的大型语言模型(LLM)生态系统中,数据创作者对其数据使用方式缺乏自主权,而LLM用户可能在不知情的情况下抄袭现有来源。将LLM生成的文本溯源至其输入数据有助于应对这些挑战,但迄今为止我们面临的问题远多于答案:LLM输出的哪些要素需要溯源?溯源应实现何种目标?应如何实施溯源机制?本文提出一种人本主义的数据溯源框架,将溯源问题置于更广阔的数据经济生态中进行审视。通过一组可配置参数(包括利益相关者目标与实施标准),可针对特定应用场景(如创意写作辅助或事实核查)定制溯源方案。这些标准将由相关利益群体——数据创作者、LLM用户及其中介机构(出版商、平台、人工智能公司)——通过协商确定。领域特异性协商的成果可付诸实施,并检验其是否达成利益相关者目标。该框架为以下三方面研究搭建了桥梁:数据溯源的NLP方法论研究、政策干预的治理研究,以及面向数据经济可持续均衡的创作者激励机制的经济学分析。