Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we still lack simple, interpretable time-series features that capture how generic versus specific content is distributed over time, and that can be used to compare human and AI-generated speech. We introduce a semantic-timescale analysis pipeline that turns word-level transcripts with timestamps into semantic time-series. For each spoken narrative, we compute (i) semantic specificity using WordNet-based word depth and (ii) contextual similarity using SBERT embeddings and quantify their temporal dependence using autocorrelation-window measures (ACW-0 and related metrics). We then compare original speech to multiple shuffled controls that selectively disrupt lexical identity, temporal order, and word duration. Across human-read autobiographical narratives, TTS readings, and LLM-generated texts rendered with TTS, we find that segments with longer ACW-0 in the semantic time-series tend to contain more generic vocabulary, whereas segments with shorter ACW-0 are enriched in more specific words. These associations are strongly attenuated or abolished when word order and timing are randomized, indicating that ACW-based measures capture non-trivial temporal organization of semantic content beyond static lexical distributions. Our results suggest that ACW-based semantic timescales are a useful family of features for analyzing and comparing the temporal structure of human and AI-generated speech.
翻译:口语表达(无论是人类还是大语言模型(LLM)生成的)随时间的推移展开,语义内容不断变化。然而,我们仍缺乏简单且可解释的时间序列特征,用以捕捉通用与具体内容随时间分布的规律,并用于比较人类与AI生成语音。我们提出一种语义时间尺度分析流程,可将带时间戳的词级转录文本转化为语义时间序列。对于每个口语叙事,我们利用基于WordNet的词语深度计算(i)语义特异性,并利用SBERT嵌入计算(ii)上下文相似性,再通过自相关窗口度量(ACW-0及相关指标)量化两者的时间依赖性。随后,我们将原始语音与多种打乱控制条件(分别选择性破坏词汇身份、时间顺序和词持续时间)进行比较。在人类朗读的自传体叙事、TTS合成语音以及经TTS渲染的LLM生成文本中,我们发现:语义时间序列中ACW-0较长的片段倾向于包含更多通用词汇,而ACW-0较短的片段则富含更具体的词汇。当词语顺序和时间被随机化时,这些关联显著减弱或消失,表明基于ACW的度量捕捉了超越静态词汇分布的语义内容非平凡时间组织。我们的结果表明,基于ACW的语义时间尺度是一类有用的特征,可用于分析和比较人类与AI生成语音的时间结构。