We explore how to capture the significance of a sub-text block in an article and how it may be used for text mining tasks. A sub-text block is a sub-sequence of sentences in the article. We formulate the notion of content significance distribution (CSD) of sub-text blocks, referred to as CSD of the first kind and denoted by CSD-1. In particular, we leverage Hugging Face's SentenceTransformer to generate contextual sentence embeddings, and use MoverScore over text embeddings to measure how similar a sub-text block is to the entire text. To overcome the exponential blowup on the number of sub-text blocks, we present an approximation algorithm and show that the approximated CSD-1 is almost identical to the exact CSD-1. Under this approximation, we show that the average and median CSD-1's for news, scholarly research, argument, and narrative articles share the same pattern. We also show that under a certain linear transformation, the complement of the cumulative distribution function of the beta distribution with certain values of $\alpha$ and $\beta$ resembles a CSD-1 curve. We then use CSD-1's to extract linguistic features to train an SVC classifier for assessing how well an article is organized. Through experiments, we show that this method achieves high accuracy for assessing student essays. Moreover, we study CSD of sentence locations, referred to as CSD of the second kind and denoted by CSD-2, and show that average CSD-2's for different types of articles possess distinctive patterns, which either conform common perceptions of article structures or provide rectification with minor deviation.
翻译:本文探讨如何捕捉文章中子文本块的重要性及其在文本挖掘任务中的应用。子文本块指文章中的句子子序列。我们提出了子文本块的内容显著性分布概念,称为第一类内容显著性分布并记作CSD-1。具体而言,我们利用Hugging Face的SentenceTransformer生成上下文句子嵌入,并通过文本嵌入的MoverScore度量子文本块与全文的相似度。为克服子文本块数量的指数级增长,我们提出一种近似算法并证明近似CSD-1与精确CSD-1几乎完全一致。在此近似下,我们证明新闻、学术研究、议论文及叙事文章的平均与中位数CSD-1具有相同分布模式。同时发现经过特定线性变换后,参数为$\alpha$和$\beta$的贝塔分布累积分布函数的补函数与CSD-1曲线形态相似。随后我们利用CSD-1提取语言特征训练SVC分类器,用于评估文章组织结构质量。实验表明该方法在学生论文评估中取得较高准确率。此外,我们研究了句子位置的内容显著性分布,称为第二类内容显著性分布并记作CSD-2,证明不同类型文章的平均CSD-2具有独特模式,这些模式要么符合对文章结构的普遍认知,要么在微小偏差范围内提供修正视角。