I apply Schmidhuber's compression progress theory of interestingness at corpus scale, analyzing semantic novelty trajectories in more than 80,000 books spanning two centuries of English-language publishing. Using sentence-transformer paragraph embeddings and a running-centroid novelty measure, I compare 28,730 pre-1920 Project Gutenberg books (PG19) against 52,796 modern English books (Books3, approximately 1990-2010). The principal findings are fourfold. First, mean paragraph-level novelty is roughly 10% higher in modern books (0.503 vs. 0.459). Second, trajectory circuitousness -- the ratio of cumulative path length to net displacement in embedding space -- nearly doubles in the modern corpus (+67%). Third, convergent narrative curves, in which novelty declines toward a settled semantic register, are 2.3x more common in pre-1920 literature. Fourth, novelty is orthogonal to reader quality ratings (r = -0.002), suggesting that interestingness in Schmidhuber's sense is structurally independent of perceived literary merit. Clustering paragraph-level trajectories via PAA-16 representations reveals eight distinct narrative-shape archetypes whose distribution shifts substantially between eras. All analysis code and an interactive exploration toolkit are publicly available at https://bigfivekiller.online/novelty_hub.
翻译:本研究将施密德胡伯的压缩进展趣味性理论应用于语料库规模,分析了跨越两个世纪的八万余部英文出版书籍中的语义新颖性轨迹。通过使用句子转换器段落嵌入和运行质心新颖性度量,我将28,730部1920年前的古腾堡计划书籍(PG19)与52,796部现代英文书籍(Books3,约1990-2010年)进行对比。主要发现包含四个方面:首先,现代书籍的段落平均新颖性高出约10%(0.503对比0.459);其次,现代语料库中轨迹迂回度(即嵌入空间中累积路径长度与净位移之比)提升近一倍(+67%);第三,收敛型叙事曲线(即新颖性向稳定语义域递减的现象)在1920年前文学作品中出现的频率高出2.3倍;第四,新颖性与读者质量评分呈正交关系(r=-0.002),表明施密德胡伯理论框架下的趣味性与感知文学价值在结构上相互独立。通过PAA-16表征对段落级轨迹进行聚类分析,揭示了八种不同的叙事形态原型,其时代分布呈现显著差异。所有分析代码及交互式探索工具包已在https://bigfivekiller.online/novelty_hub公开提供。