We analyze a large corpus of police incident narrative documents in understanding the spatial distribution of the topics. The motivation for doing this is that police narratives in each incident report contains very fine-grained information that is richer than the category that is manually assigned by the police. Our approach is to split the corpus into topics using two different unsupervised machine learning algorithms - Latent Dirichlet Allocation and Non-negative Matrix Factorization. We validate the performance of each learned topic model using model coherence. Then, using a k-nearest neighbors density ratio estimation (kNN-DRE) approach that we propose, we estimate the spatial density ratio per topic and use this for data discovery and analysis of each topic, allowing for insights into the described incidents at scale. We provide a qualitative assessment of each topic and highlight some key benefits for using our kNN-DRE model for estimating spatial trends.
翻译:我们分析了一个大型警察事件叙述文本语料库,以理解主题的空间分布。这样做是因为每份事件报告中的警察叙述包含比警方手工分配的类别更丰富的细粒度信息。我们的方法是使用两种不同的无监督机器学习算法——潜在狄利克雷分配和非负矩阵分解,将语料库分割为不同主题。我们利用模型一致性验证每个学习到的主题模型的性能。然后,采用我们提出的k近邻密度比估计方法,估算每个主题的空间密度比,并将其用于数据发现和主题分析,从而能够大规模理解所描述事件的深层信息。我们对每个主题进行了定性评估,并强调了使用我们的kNN-DRE模型估计空间趋势的一些关键优势。