Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries

Meaning in human language is relational, context dependent, and emergent, arising from dynamic systems of signs rather than fixed word-concept mappings. In computational settings, this semiotic and interpretive complexity complicates the generation and evaluation of meaning. This article proposes an interdisciplinary framework for studying meaning in large language model (LLM) generated language by integrating semiotics and hermeneutics with qualitative research methods. We review prior scholarship on meaning and machines, examining how linguistic signs are transformed into vectorized representations in static and contextualized embedding models, and identify gaps between statistical approximation and human interpretive meaning. We then introduce the Inductive Conceptual Rating (ICR) metric, a qualitative evaluation approach grounded in inductive content analysis and reflexive thematic analysis, designed to assess semantic accuracy and meaning alignment in LLM-outputs beyond lexical similarity metrics. We apply ICR in an empirical comparison of LLM generated and human generated thematic summaries across five datasets (N = 50 to 800). While LLMs achieve high linguistic similarity, they underperform on semantic accuracy, particularly in capturing contextually grounded meanings. Performance improves with larger datasets but remains variable across models, potentially reflecting differences in the frequency and coherence of recurring concepts and meanings. We conclude by arguing for evaluation frameworks that leverage systematic qualitative interpretation practices when assessing meaning in LLM-generated outputs from reference texts.

翻译：人类语言中的意义是关系性的、依赖于语境的、并涌现的，它源于动态的符号系统，而非固定的词语-概念映射。在计算环境中，这种符号学与诠释的复杂性使得意义的生成与评估变得困难。本文提出一个跨学科框架，通过将符号学、诠释学与定性研究方法相结合，来研究大型语言模型（LLM）生成语言中的意义。我们回顾了先前关于意义与机器的学术研究，考察了语言符号如何在静态和情境化嵌入模型中被转化为向量化表示，并指出了统计近似与人类诠释意义之间的差距。接着，我们引入了归纳概念评分（ICR）度量，这是一种基于归纳内容分析和反思性主题分析的定性评估方法，旨在超越词汇相似度度量，评估LLM输出中的语义准确性和意义对齐度。我们在五个数据集（N = 50 至 800）上，对LLM生成和人类生成的主题摘要进行了实证比较，应用了ICR。虽然LLM在语言相似性上取得了高分，但在语义准确性方面表现不佳，尤其是在捕捉基于语境的意义方面。性能随数据集增大而提升，但在不同模型间仍存在差异，这可能反映了重复出现的概念和意义在频率与连贯性上的不同。最后，我们主张在评估LLM根据参考文本生成的输出中的意义时，应采用利用系统性定性诠释实践的评估框架。