With the rise of deep learning, large datasets and complex models have become common, requiring significant computing power. To address this, data distillation has emerged as a technique to quickly train models with lower memory and time requirements. However, data distillation on text-based datasets hasn't been explored much because of the challenges rising due to its discrete nature. Additionally, existing dataset distillation methods often struggle to generalize to new architectures. In the paper, we propose several data distillation techniques for multilingual text classification datasets using language-model-based learning methods. We conduct experiments to analyze their performance in terms of classification strength, and cross-architecture generalization. Furthermore, we investigate the language-specific fairness of the data summaries generated by these methods. Our approach builds upon existing techniques, enhancing cross-architecture generalization in the text data distillation domain.
翻译:随着深度学习的兴起,大规模数据集和复杂模型已变得普遍,这需要巨大的计算能力。为解决这一问题,数据蒸馏技术应运而生,该技术能够以更低的内存和时间需求快速训练模型。然而,由于文本数据离散性带来的挑战,基于文本数据集的数据蒸馏尚未得到充分探索。此外,现有的数据集蒸馏方法往往难以泛化到新的架构。在本文中,我们提出了几种基于语言模型学习方法的多语言文本分类数据集蒸馏技术。我们通过实验分析了这些技术在分类能力和跨架构泛化方面的性能。此外,我们研究了这些方法生成的数据摘要的语言特异性公平性。我们的方法建立在现有技术之上,增强了文本数据蒸馏领域的跨架构泛化能力。