This article investigates the knowledge transfer from the RuQTopics dataset. This Russian topical dataset combines a large sample number (361,560 single-label, 170,930 multi-label) with extensive class coverage (76 classes). We have prepared this dataset from the "Yandex Que" raw data. By evaluating the RuQTopics - trained models on the six matching classes of the Russian MASSIVE subset, we have proved that the RuQTopics dataset is suitable for real-world conversational tasks, as the Russian-only models trained on this dataset consistently yield an accuracy around 85\% on this subset. We also have figured out that for the multilingual BERT, trained on the RuQTopics and evaluated on the same six classes of MASSIVE (for all MASSIVE languages), the language-wise accuracy closely correlates (Spearman correlation 0.773 with p-value 2.997e-11) with the approximate size of the pretraining BERT's data for the corresponding language. At the same time, the correlation of the language-wise accuracy with the linguistical distance from Russian is not statistically significant.
翻译:本文研究了从RuQTopics数据集进行知识迁移的问题。该俄语主题数据集兼具大规模样本量(单标签样本361,560个,多标签样本170,930个)与广泛的类别覆盖(76个类别)。我们基于"Yandex Que"原始数据构建了该数据集。通过对RuQTopics训练模型在俄语MASSIVE子集的六个匹配类别上进行评估,我们证明RuQTopics数据集适用于真实世界对话任务——基于该数据集训练的纯俄语模型在此子集上始终能达到约85%的准确率。我们还发现,对于在RuQTopics上训练并在MASSIVE(涵盖所有MASSIVE语言)的同一六个类别上进行评估的多语言BERT模型而言,其逐语言准确率与对应语言在BERT预训练数据中的近似规模呈强相关(Spearman相关系数为0.773,p值为2.997e-11)。与此同时,逐语言准确率与语言距俄语的距离之间的相关性在统计上并不显著。