Artistic pieces can be studied from several perspectives, one example being their reception among readers over time. In the present work, we approach this interesting topic from the standpoint of literary works, particularly assessing the task of predicting whether a book will become a best seller. Dissimilarly from previous approaches, we focused on the full content of books and considered visualization and classification tasks. We employed visualization for the preliminary exploration of the data structure and properties, involving SemAxis and linear discriminant analyses. Then, to obtain quantitative and more objective results, we employed various classifiers. Such approaches were used along with a dataset containing (i) books published from 1895 to 1924 and consecrated as best sellers by the Publishers Weekly Bestseller Lists and (ii) literary works published in the same period but not being mentioned in that list. Our comparison of methods revealed that the best-achieved result - combining a bag-of-words representation with a logistic regression classifier - led to an average accuracy of 0.75 both for the leave-one-out and 10-fold cross-validations. Such an outcome suggests that it is unfeasible to predict the success of books with high accuracy using only the full content of the texts. Nevertheless, our findings provide insights into the factors leading to the relative success of a literary work.
翻译:艺术品可从多个视角进行研究,例如读者随时间推移对其的接受程度。本研究从文学作品的角度切入这一有趣课题,重点评估预测一本书能否成为畅销书的任务。与以往方法不同,我们聚焦书籍的全文内容,并开展可视化与分类研究。我们运用可视化技术对数据结构和属性进行初步探索,涉及SemAxis分析和线性判别分析。随后,为获得定量且更客观的结果,我们采用了多种分类器。这些方法结合了一个数据集,该数据集包含:(i) 1895年至1924年间出版并被《出版商周刊》畅销书榜单列为畅销书的书籍;(ii) 同期出版但未出现在该榜单中的文学作品。方法对比显示,最优结果——将词袋表示与逻辑回归分类器相结合——在留一法和10折交叉验证下均达到0.75的平均准确率。这一结果表明,仅依靠文本全文内容高精度预测书籍成功与否并不可行。然而,我们的发现为理解文学作品相对成功的促成因素提供了洞见。