This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs, and yet it receives disproportionally low attention from the research community. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization. In each scenario, we underscore the importance of data, highlight promising research directions, and articulate the potential impacts on the research community and, where applicable, the society as a whole. For instance, we advocate for a suite of data-centric benchmarks tailored to the scale and complexity of data for LLMs. These benchmarks can be used to develop new data curation methods and document research efforts and results, which can help promote openness and transparency in AI and LLM research.
翻译:本立场论文提出一种以数据为中心的AI研究视角,重点关注大语言模型(LLMs)。我们首先提出关键观察:数据在LLMs的开发阶段(例如预训练与微调)和推理阶段(例如上下文学习)均具有核心作用,但研究界对其关注度严重不足。我们围绕数据识别出四个具体场景,涵盖以数据为中心的基准测试与数据策管、数据归因、知识迁移以及推理情境化。在每个场景中,我们强调数据的重要性,指明具有前景的研究方向,并阐明其对研究界乃至整个社会的潜在影响。例如,我们主张建立一套适应LLM数据规模与复杂度的数据为中心基准测试体系。这些基准可用于开发新型数据策管方法、记录研究过程与成果,从而促进AI与LLM研究的开放性与透明度。