While quantitative methods have been used to examine changes in word usage in books, studies have focused on overall trends, such as the shapes of narratives, which are independent of book length. We instead look at how words change over the course of a book as a function of the number of words, rather than the fraction of the book, completed at any given point; we define this measure as "cumulative word-time". Using ousiometrics, a reinterpretation of the valence-arousal-dominance framework of meaning obtained from semantic differentials, we convert text into time series of power and danger scores in cumulative word-time. Each time series is then decomposed using empirical mode decomposition into a sum of constituent oscillatory modes and a non-oscillatory trend. By comparing the decomposition of the original power and danger time series with those derived from shuffled text, we find that shorter books exhibit only a general trend, while longer books have fluctuations in addition to the general trend. These fluctuations typically have a period of a few thousand words regardless of the book length or library classification code, but vary depending on the content and structure of the book. Our findings suggest that, in the ousiometric sense, longer books are not expanded versions of shorter books, but are more similar in structure to a concatenation of shorter texts. Further, they are consistent with editorial practices that require longer texts to be broken down into sections, such as chapters. Our method also provides a data-driven denoising approach that works for texts of various lengths, in contrast to the more traditional approach of using large window sizes that may inadvertently smooth out relevant information, especially for shorter texts. These results open up avenues for future work in computational literary analysis, particularly the measurement of a basic unit of narrative.
翻译:尽管定量方法已被用于考察书籍中词语使用的变化,但相关研究主要聚焦于整体趋势(如不依赖于书籍长度的叙事形态)。我们转而关注词语在书籍进程中如何随累积字数(而非已读比例)变化,并将这一度量定义为“累积词时”。借助语义差异法构建的效价-唤醒-支配意义框架的 reinterpretation(即情感测量学),我们将文本转换为累积词时维度上的权力与危险得分时间序列。随后采用经验模态分解将每个时间序列分解为若干振荡模式分量与一个非振荡趋势项。通过将原始权力与危险时间序列的分解结果与打乱文本的分解结果进行对比,我们发现:短篇书籍仅呈现单一总体趋势,而长篇书籍在总体趋势之外还存在波动。这些波动通常具有数千词的周期,且与书籍长度或图书馆分类编码无关,但会因书籍内容与结构而异。研究表明:在情感测量学意义上,长篇书籍并非短篇书籍的扩展版本,其结构更接近若干短文本的拼接。这一发现与要求长文本按章节(如章回体)划分的编辑惯例相吻合。我们的方法还为不同长度的文本提供了数据驱动的去噪方案——相比之下,传统方法使用大型窗口可能无意中平滑掉相关信息(尤其对短文本而言)。这些结果为计算文学分析(特别是叙事基本单位的测度)开辟了新的研究路径。