While quantitative methods have been used to examine changes in word usage in books, studies have focused on overall trends, such as the shapes of narratives, which are independent of book length. We instead look at how words change over the course of a book as a function of the number of words, rather than the fraction of the book, completed at any given point; we define this measure as "cumulative word-time". Using ousiometrics, a reinterpretation of the valence-arousal-dominance framework of meaning obtained from semantic differentials, we convert text into time series of power and danger scores in cumulative word-time. Each time series is then decomposed using empirical mode decomposition into a sum of constituent oscillatory modes and a non-oscillatory trend. By comparing the decomposition of the original power and danger time series with those derived from shuffled text, we find that shorter books exhibit only a general trend, while longer books have fluctuations in addition to the general trend. These fluctuations typically have a period of a few thousand words regardless of the book length or library classification code, but vary depending on the content and structure of the book. Our findings suggest that, in the ousiometric sense, longer books are not expanded versions of shorter books, but are more similar in structure to a concatenation of shorter texts. Further, they are consistent with editorial practices that require longer texts to be broken down into sections, such as chapters. Our method also provides a data-driven denoising approach that works for texts of various lengths, in contrast to the more traditional approach of using large window sizes that may inadvertently smooth out relevant information, especially for shorter texts. These results open up avenues for future work in computational literary analysis, particularly the measurement of a basic unit of narrative.
翻译:尽管定量方法已被用于考察书籍中词语使用的变化,但现有研究多聚焦于独立于书籍长度的整体趋势,例如叙事形态。我们转而研究词语在书籍进程中的变化方式,将其定义为已完成词数(而非书籍已完成比例)的函数,并将这一度量称为“词-时间累积”。借助奥西计量学——一种从语义差异中获得的基于效价-唤醒-优势意义框架的重新诠释——我们将文本转换为累积词-时间下的权力与危险评分时间序列。随后利用经验模态分解,将各时间序列分解为若干构成性振荡模态和一个非振荡趋势。通过比较原始权力与危险时间序列的分解结果与乱序文本生成的分解结果,我们发现短篇书籍仅呈现总体趋势,而长篇书籍除总体趋势外还包含波动。这些波动周期通常为数千词,不受书籍长度或图书馆分类代码影响,但随书籍内容与结构而变化。我们的发现表明,在奥西计量学意义上,长篇书籍并非短篇书籍的扩展版本,其结构更类似于短文本的串联。此外,该结论与要求长文本需划分为章节等部分的编辑实践相一致。与采用大窗口尺寸(可能无意中平滑相关信息,尤其对短文本而言)的传统方法不同,我们的方法还为不同长度文本提供了一种数据驱动的去噪手段。这些结果为计算文学分析领域的未来研究开辟了新途径,尤其在叙事基本单位的测量方面。