Defining psycholinguistic characteristics in written texts is a task gaining increasing attention from researchers. One of the most widely used tools in the current field is Linguistic Inquiry and Word Count (LIWC) that originally was developed to analyze English texts and translated into multiple languages. Our approach offers the adaptation of LIWC methodology for the Russian language, considering its grammatical and cultural specificities. The suggested approach comprises 96 categories, integrating syntactic, morphological, lexical, general statistical features, and results of predictions obtained using pre-trained language models (LMs) for text analysis. Rather than applying direct translation to existing thesauri, we built the dictionary specifically for the Russian language based on the content from several lexicographic resources, semantic dictionaries and corpora. The paper describes the process of mapping lemmas to 42 psycholinguistic categories and the implementation of the analyzer as part of RusLICA web service.
翻译:定义书面文本的心理语言学特征是一项日益受到研究者关注的任务。当前领域最广泛使用的工具之一是语言探究与词频统计(LIWC),该工具最初为分析英语文本而开发,并已翻译成多种语言。我们的方法旨在将LIWC方法论适配于俄语,同时考虑其语法和文化特性。所提出的方法包含96个类别,整合了句法、形态、词汇、通用统计特征以及使用预训练语言模型进行文本分析所得的预测结果。我们并未对现有词库进行直接翻译,而是基于多种词典资源、语义词典和语料库的内容,专门为俄语构建了词典。本文描述了将词元映射至42个心理语言学类别的过程,以及作为RusLICA网络服务一部分的分析器实现。