Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages - particularly Chinese, Japanese, and Korean (CJK) - remains fragmented and underexplored, despite these languages together serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing - ultimately guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.
翻译:自然语言处理(NLP)领域的最新进展凸显了高质量数据集在构建大语言模型(LLM)中的关键作用。然而,尽管针对英语存在大量资源与分析,服务于超过16亿使用者的东亚语言——特别是中文、日文和韩文(CJK)——其数据生态依然呈现碎片化且研究不足。为填补这一空白,我们从跨语言视角对HuggingFace生态系统展开研究,重点关注文化规范、研究环境与制度实践如何影响数据集的可用性与质量。基于对3,300余个数据集的分析,我们采用定量与定性方法,考察这些因素如何驱动中文、日文和韩文NLP社区形成差异化的数据创建与治理模式。研究发现揭示了中文数据集普遍具有规模大且多由机构主导的特性,韩文NLP数据发展呈现出显著的社区草根驱动特征,而日文数据集合则明显侧重于娱乐与亚文化领域。通过揭示这些模式,我们提出了增强数据集文档化、明确许可协议及促进跨语言资源共享的实用策略,从而为东亚地区更高效且文化适配的LLM开发提供指引。最后,我们探讨了未来数据集治理与协作的最佳实践,旨在加强三种语言的资源建设。