Analyzing the writing styles of authors and articles is a key to supporting various literary analyses such as author attribution and genre detection. Over the years, rich sets of features that include stylometry, bag-of-words, n-grams have been widely used to perform such analysis. However, the effectiveness of these features largely depends on the linguistic aspects of a particular language and datasets specific characteristics. Consequently, techniques based on these feature sets cannot give desired results across domains. In this paper, we propose a novel Word2vec graph based modeling of a document that can rightly capture both context and style of the document. By using these Word2vec graph based features, we perform classification to perform author attribution and genre detection tasks. Our detailed experimental study with a comprehensive set of literary writings shows the effectiveness of this method over traditional feature based approaches. Our code and data are publicly available at https://cutt.ly/svLjSgk
翻译:分析作者和文章的写作风格是支持作者归因、体裁检测等多种文学分析的关键。多年来,丰富的特征集(包括文体计量学、词袋模型、n-gram)已被广泛用于此类分析。然而,这些特征的有效性在很大程度上取决于特定语言的 linguistic 方面和数据集的特有特性。因此,基于这些特征集的技术无法在不同领域中获得理想结果。本文提出了一种新颖的基于Word2vec图的文档建模方法,能够准确捕捉文档的上下文和风格。通过使用这些基于Word2vec图的特征,我们进行分类以执行作者归因和体裁检测任务。我们使用综合文学作品集进行的详细实验研究表明,该方法优于传统基于特征的方法。我们的代码和数据已在https://cutt.ly/svLjSgk 公开发布。