We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the language's agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 inflected forms for each word and is primarily dominated by denominal verbs (i.e., [noun] +suru (do-PRS)). Morphologically, this form is equivalent to the verb suru (do). In contrast, J-UniMorph explores a much broader and more frequently used range of verb forms, offering 118 inflected forms for each word on average. It includes honorifics, a range of politeness levels, and other linguistic nuances, emphasizing the distinctive characteristics of the Japanese language. This paper presents detailed statistics and characteristics of J-UniMorph, comparing it with the Wiktionary Edition. We release J-UniMorph and its interactive visualizer publicly available, aiming to support cross-linguistic research and various applications.
翻译:我们提出了一个基于UniMorph特征模式的日语形态数据集J-UniMorph。该数据集针对日语黏着语特性所特有的丰富动词形态进行了标注。J-UniMorph与现有从维基词典自动提取的UniMorph日语子集存在显著差异。平均而言,维基词典版为每个词条提供约12种屈折形式,且以动名词(即[noun]+suru(do-PRS))为主;从形态学角度看,该形式等价于动词suru(do)。相比之下,J-UniMorph探索了更广泛且高频使用的动词形态范畴,为每个词条平均提供118种屈折形式,涵盖敬语体系、多层级礼貌表达及其他语言细微差异,突出展现了日语的语言特征。本文详细统计分析了J-UniMorph的数据特征,并与维基词典版进行了对比。我们已将J-UniMorph及其交互式可视化工具公开发布,旨在支持跨语言研究及各类应用场景。