This work aims to provide an overview on the open-source multilanguage tool called StyloMetrix. It offers stylometric text representations that cover various aspects of grammar, syntax and lexicon. StyloMetrix covers four languages: Polish as the primary language, English, Ukrainian and Russian. The normalized output of each feature can become a fruitful course for machine learning models and a valuable addition to the embeddings layer for any deep learning algorithm. We strive to provide a concise, but exhaustive overview on the application of the StyloMetrix vectors as well as explain the sets of the developed linguistic features. The experiments have shown promising results in supervised content classification with simple algorithms as Random Forest Classifier, Voting Classifier, Logistic Regression and others. The deep learning assessments have unveiled the usefulness of the StyloMetrix vectors at enhancing an embedding layer extracted from Transformer architectures. The StyloMetrix has proven itself to be a formidable source for the machine learning and deep learning algorithms to execute different classification tasks.
翻译:本工作旨在概述名为StyloMetrix的开源多语言工具。该工具提供涵盖语法、句法和词汇等多层面的文体文本表征。StyloMetrix覆盖四种语言:以波兰语为主要语言,以及英语、乌克兰语和俄语。各特征归一化后的输出可为机器学习模型提供丰富素材,并成为任何深度学习算法嵌入层的宝贵补充。我们力求对StyloMetrix向量的应用提供简洁而全面的概述,同时阐释所开发的语言特征集合。实验表明,在随机森林分类器、投票分类器、逻辑回归等简单算法的监督分类任务中,该方法取得了良好效果。深度学习评估揭示了StyloMetrix向量在增强基于Transformer架构提取的嵌入层方面的实用价值。StyloMetrix已证明其作为机器学习与深度学习算法执行各类分类任务的强大数据源。