No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem

Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.

翻译：自然语言处理（NLP）领域的最新进展凸显了高质量数据集在构建大语言模型（LLM）中的关键作用。然而，尽管针对英语存在大量资源与分析，服务于超过16亿使用者的东亚语言——特别是中文、日文和韩文（CJK）——其数据生态依然呈现碎片化且研究不足。为填补这一空白，我们从跨语言视角对HuggingFace生态系统展开研究，重点关注文化规范、研究环境与制度实践如何影响数据集的可用性与质量。基于对3,300余个数据集的分析，我们采用定量与定性方法，考察这些因素如何驱动中文、日文和韩文NLP社区形成差异化的数据创建与治理模式。研究发现揭示了中文数据集普遍具有规模大且多由机构主导的特性，韩文NLP数据发展呈现出显著的社区草根驱动特征，而日文数据集合则明显侧重于娱乐与亚文化领域。通过揭示这些模式，我们提出了增强数据集文档化、明确许可协议及促进跨语言资源共享的实用策略，从而为东亚地区更高效且文化适配的LLM开发提供指引。最后，我们探讨了未来数据集治理与协作的最佳实践，旨在加强三种语言的资源建设。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日