We explore how to capture the significance of a sub-text block in an article and how it may be used for text mining tasks. A sub-text block is a sub-sequence of sentences in the article. We formulate the notion of content significance distribution (CSD) of sub-text blocks, referred to as CSD of the first kind and denoted by CSD-1. In particular, we leverage Hugging Face's SentenceTransformer to generate contextual sentence embeddings, and use MoverScore over text embeddings to measure how similar a sub-text block is to the entire text. To overcome the exponential blowup on the number of sub-text blocks, we present an approximation algorithm and show that the approximated CSD-1 is almost identical to the exact CSD-1. Under this approximation, we show that the average and median CSD-1's for news, scholarly research, argument, and narrative articles share the same pattern. We also show that under a certain linear transformation, the complement of the cumulative distribution function of the beta distribution with certain values of $\alpha$ and $\beta$ resembles a CSD-1 curve. We then use CSD-1's to extract linguistic features to train an SVC classifier for assessing how well an article is organized. Through experiments, we show that this method achieves high accuracy for assessing student essays. Moreover, we study CSD of sentence locations, referred to as CSD of the second kind and denoted by CSD-2, and show that average CSD-2's for different types of articles possess distinctive patterns, which either conform common perceptions of article structures or provide rectification with minor deviation.
翻译:我们探究如何捕捉文章中次文本块的重要性,及其在文本挖掘任务中的应用。次文本块是文章中句子的子序列。我们定义了次文本块的内容重要性分布(CSD)概念,称为第一类CSD,记为CSD-1。具体而言,我们利用Hugging Face的SentenceTransformer生成上下文相关的句子嵌入,并使用MoverScore对文本嵌入进行评估,以衡量次文本块与整篇文本的相似度。为解决次文本块数量指数级增长的问题,我们提出一种近似算法,并证明近似后的CSD-1与精确CSD-1几乎一致。在此近似下,我们证明新闻、学术研究、论辩和叙事类文章的平均与中位数CSD-1呈现相同模式。我们还证明,在特定线性变换下,取特定α和β值的贝塔分布的累积分布函数补集与CSD-1曲线相似。随后,我们利用CSD-1提取语言特征,训练SVC分类器以评估文章的组织结构质量。实验表明,该方法在学生作文评估中取得高准确率。此外,我们研究了句子位置的CSD,称为第二类CSD(记为CSD-2),发现不同文章类型的平均CSD-2呈现独特模式,这些模式或符合人们对文章结构的普遍认知,或在微小偏差下提供了修正。