In this study, we employ a classification approach to show that different categories of literary "quality" display unique linguistic profiles, leveraging a corpus that encompasses titles from the Norton Anthology, Penguin Classics series, and the Open Syllabus project, contrasted against contemporary bestsellers, Nobel prize winners and recipients of prestigious literary awards. Our analysis reveals that canonical and so called high-brow texts exhibit distinct textual features when compared to other quality categories such as bestsellers and popular titles as well as to control groups, likely responding to distinct (but not mutually exclusive) models of quality. We apply a classic machine learning approach, namely Random Forest, to distinguish quality novels from "control groups", achieving up to 77\% F1 scores in differentiating between the categories. We find that quality category tend to be easier to distinguish from control groups than from other quality categories, suggesting than literary quality features might be distinguishable but shared through quality proxies.
翻译:本研究采用分类方法,通过涵盖诺顿文选、企鹅经典系列及开放课程大纲项目中的书目,并对比当代畅销书、诺贝尔文学奖得主及重要文学奖项获奖作品构成的语料库,揭示了不同文学"质量"类别呈现独特的语言轮廓。分析表明,经典作品及所谓"高雅文本"相较于其他质量类别(如畅销书、热门读物及对照组),呈现出截然不同的文本特征,这很可能反映了各具特色(但并非互斥)的质量评价模式。我们应用经典机器学习方法——随机森林算法,对质量小说与"对照组"进行区分,在类别间区分上取得了高达77%的F1分值。研究发现,质量类别与对照组的区分难度普遍低于不同质量类别间的区分,这表明文学质量特征虽可区分,却可能通过质量代理指标共享。