While utilizing syntactic tools such as parts-of-speech (POS) tagging has helped us understand sentence structures and their distribution across diverse corpora, it is quite complex and poses a challenge in natural language processing (NLP). This study focuses on understanding sentence structure balance - usages of nouns, verbs, determiners, etc - harmoniously without relying on such tools. It proposes a novel statistical method that uses American Standard Code for Information Interchange (ASCII) codes to represent text of 11 text corpora from various sources and their lexical category alignment after using their compressed versions through PCA, and analyzes the results through histograms and normality tests such as Shapiro-Wilk and Anderson-Darling Tests. By focusing on ASCII codes, this approach simplifies text processing, although not replacing any syntactic tools but complementing them by offering it as a resource-efficient tool for assessing text balance. The story generated by Grok shows near normality indicating balanced sentence structures in LLM outputs, whereas 4 out of the remaining 10 pass the normality tests. Further research could explore potential applications in text quality evaluation and style analysis with syntactic integration for more broader tasks.
翻译:尽管利用词性标注等句法工具有助于我们理解句子结构及其在不同语料库中的分布,但这类方法较为复杂,对自然语言处理构成了挑战。本研究旨在不依赖此类工具的情况下,理解句子结构的平衡性——即名词、动词、限定词等的和谐使用。本文提出了一种新颖的统计方法:首先使用美国信息交换标准码表示来自11个不同来源的文本语料库内容,通过主成分分析对其压缩版本进行词汇类别对齐,并借助直方图及夏皮罗-威尔克检验、安德森-达林检验等正态性检验分析结果。该方法通过聚焦ASCII编码简化了文本处理流程,虽不能替代现有句法工具,但可作为评估文本平衡性的资源高效型工具与之形成互补。Grok生成的故事数据呈现接近正态分布,表明大语言模型输出具有平衡的句子结构,而其余10个语料库中有4个通过了正态性检验。未来研究可探索该方法在文本质量评估与风格分析中的潜在应用,并结合句法整合以处理更广泛的任务。