A consistent body of evidence suggests that dream reports significantly vary from other types of textual transcripts with respect to semantic content. Furthermore, it appears to be a widespread belief in the dream/sleep research community that dream reports constitute rather ``unique'' strings of text. This might be a notable issue for the growing amount of approaches using natural language processing (NLP) tools to automatically analyse dream reports, as they largely rely on neural models trained on non-dream corpora scraped from the web. In this work, I will adopt state-of-the-art (SotA) large language models (LLMs), to study if and how dream reports deviate from other human-generated text strings, such as Wikipedia. Results show that, taken as a whole, DreamBank does not deviate from Wikipedia. Moreover, on average, single dream reports are significantly more predictable than Wikipedia articles. Preliminary evidence suggests that word count, gender, and visual impairment can significantly shape how predictable a dream report can appear to the model.
翻译:一系列一致的研究证据表明,梦境报告在语义内容上与其他类型的文本记录存在显著差异。此外,梦境/睡眠研究学界普遍认为,梦境报告构成了相当“独特”的文本串。这对于越来越多使用自然语言处理(NLP)工具自动分析梦境报告的方法而言可能是一个显著问题,因为这些方法很大程度上依赖于基于网络爬取的非梦境语料库训练的神经模型。本研究将采用最先进的(SotA)大语言模型(LLMs),探讨梦境报告是否以及如何偏离其他人类生成的文本串(如维基百科)。结果表明,整体而言,DreamBank并未偏离维基百科。此外,平均而言,单篇梦境报告比维基百科文章明显更可预测。初步证据表明,词数、性别和视觉障碍会显著影响梦境报告在模型面前的可预测性。