This is a case study, where Taxicab Correspondence Analysis reveals that the underlying structure of an extremely sparse binary textual data set can be represented by a binary tree, where the nodes representing clusters of words can be interpreted as topics. The textual data set represents Israel's Declaration of Independence text and 40 diverse Israeli Interviewees. The analysis provides for a compare and contrast study of textual data coming from two different sources. Furthermore, we propose an adjusted sparsity index which takes into account the size of the data table.
翻译:本研究以案例形式展开,利用曼哈顿对应分析揭示了一个极端稀疏的二元文本数据集的潜在结构可被表示为二元树,其中代表词语聚类的节点可解读为主题。该文本数据集包含以色列独立宣言文本及40位不同背景的以色列受访者访谈内容。通过分析,本研究对来自两种不同来源的文本数据进行了对比研究。此外,我们提出了一种考虑数据表规模的调整稀疏度指数。