Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.
翻译:在学术界,韩国语常被称为低资源语言。虽然这种说法部分属实,但部分原因在于资源的可用性未能得到充分宣传和系统整理。本研究整理并评述了韩国语语料库清单,首先描述机构层面的资源开发,进而系统梳理当前针对不同任务类型的开放数据集。我们进一步提出面向资源稀缺语言的开源数据集构建与发布方向,以促进相关领域的研究发展。