Formality is one of the important characteristics of text documents. The automatic detection of the formality level of a text is potentially beneficial for various natural language processing tasks. Before, two large-scale datasets were introduced for multiple languages featuring formality annotation -- GYAFC and X-FORMAL. However, they were primarily used for the training of style transfer models. At the same time, the detection of text formality on its own may also be a useful application. This work proposes the first to our knowledge systematic study of formality detection methods based on statistical, neural-based, and Transformer-based machine learning methods and delivers the best-performing models for public usage. We conducted three types of experiments -- monolingual, multilingual, and cross-lingual. The study shows the overcome of Char BiLSTM model over Transformer-based ones for the monolingual and multilingual formality classification task, while Transformer-based classifiers are more stable to cross-lingual knowledge transfer.
翻译:正式程度是文本文档的重要特征之一。自动检测文本的正式程度对多种自然语言处理任务具有潜在价值。此前,研究者引入了两个面向多语言的大规模正式程度标注数据集——GYAFC和X-FORMAL,但它们主要用于训练风格迁移模型。与此同时,文本正式程度本身的检测也可能是一项实用应用。本研究首次系统性地基于统计方法、神经网络方法和基于Transformer的机器学习方法开展正式程度检测研究,并提供了性能最优的公开可用模型。我们进行了三类实验:单语言、多语言和跨语言实验。研究表明,在单语言和多语言正式程度分类任务中,Char BiLSTM模型的表现优于基于Transformer的模型,而基于Transformer的分类器在跨语言知识迁移方面更为稳健。