The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales

Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we still lack simple, interpretable time-series features that capture how generic versus specific content is distributed over time, and that can be used to compare human and AI-generated speech. We introduce a semantic-timescale analysis pipeline that turns word-level transcripts with timestamps into semantic time-series. For each spoken narrative, we compute (i) semantic specificity using WordNet-based word depth and (ii) contextual similarity using SBERT embeddings and quantify their temporal dependence using autocorrelation-window measures (ACW-0 and related metrics). We then compare original speech to multiple shuffled controls that selectively disrupt lexical identity, temporal order, and word duration. Across human-read autobiographical narratives, TTS readings, and LLM-generated texts rendered with TTS, we find that segments with longer ACW-0 in the semantic time-series tend to contain more generic vocabulary, whereas segments with shorter ACW-0 are enriched in more specific words. These associations are strongly attenuated or abolished when word order and timing are randomized, indicating that ACW-based measures capture non-trivial temporal organization of semantic content beyond static lexical distributions. Our results suggest that ACW-based semantic timescales are a useful family of features for analyzing and comparing the temporal structure of human and AI-generated speech.

翻译：口语表达（无论是人类还是大语言模型（LLM）生成的）随时间的推移展开，语义内容不断变化。然而，我们仍缺乏简单且可解释的时间序列特征，用以捕捉通用与具体内容随时间分布的规律，并用于比较人类与AI生成语音。我们提出一种语义时间尺度分析流程，可将带时间戳的词级转录文本转化为语义时间序列。对于每个口语叙事，我们利用基于WordNet的词语深度计算（i）语义特异性，并利用SBERT嵌入计算（ii）上下文相似性，再通过自相关窗口度量（ACW-0及相关指标）量化两者的时间依赖性。随后，我们将原始语音与多种打乱控制条件（分别选择性破坏词汇身份、时间顺序和词持续时间）进行比较。在人类朗读的自传体叙事、TTS合成语音以及经TTS渲染的LLM生成文本中，我们发现：语义时间序列中ACW-0较长的片段倾向于包含更多通用词汇，而ACW-0较短的片段则富含更具体的词汇。当词语顺序和时间被随机化时，这些关联显著减弱或消失，表明基于ACW的度量捕捉了超越静态词汇分布的语义内容非平凡时间组织。我们的结果表明，基于ACW的语义时间尺度是一类有用的特征，可用于分析和比较人类与AI生成语音的时间结构。

相关内容

关注 7111

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

大语言模型在人类移动性领域的应用：机遇、挑战与未来方向

专知会员服务

15+阅读 · 3月17日

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

专知会员服务

29+阅读 · 2月27日

【NTU博士论文】让语言模型更接近人类学习者

专知会员服务

18+阅读 · 2025年5月3日

【EMNLP2024教程】科学时代的大语言模型中的人工智能，124页ppt

专知会员服务

43+阅读 · 2024年11月15日