Across many scientific fields, measurements often represent the number of times an event occurs. For example, a document can be represented by word occurrence counts, neural activity by spike counts per time window, or online communication by daily email counts. These measurements yield high-dimensional count data that often approximate a Poisson distribution, frequently with low rates that produce substantial sparsity and complicate downstream analysis. A useful approach is to embed the data into a low-dimensional space that preserves meaningful structure, commonly termed dimensionality reduction. Yet existing dimensionality reduction methods, including both linear (e.g., PCA) and nonlinear approaches (e.g., t-SNE), often assume continuous Euclidean geometry, thereby misaligning with the discrete, sparse nature of low-rate count data. Here, we propose p-SNE (Poisson Stochastic Neighbor Embedding), a nonlinear neighbor embedding method designed around the Poisson structure of count data, using KL divergence between Poisson distributions to measure pairwise dissimilarity and Hellinger distance to optimize the embedding. We test p-SNE on synthetic Poisson data and demonstrate its ability to recover meaningful structure in real-world count datasets, including weekday patterns in email communication, research area clusters in OpenReview papers, and temporal drift and stimulus gradients in neural spike recordings.
翻译:在许多科学领域中,测量结果通常表示事件发生的次数。例如,文档可通过词频表示,神经活动可通过每个时间窗口的脉冲计数表示,在线通信则可通过每日邮件数量表示。这些测量产生的高维计数数据通常近似泊松分布,且往往因低发生率而产生大量稀疏性,给后续分析带来困难。一种有效的方法是将数据嵌入到保留有意义结构的低维空间中,即通常所说的降维。然而现有的降维方法,包括线性方法(如PCA)和非线性方法(如t-SNE),通常假设连续的欧几里得几何结构,从而与低发生率计数数据的离散稀疏特性不一致。为此,我们提出p-SNE(泊松随机邻域嵌入),这是一种围绕计数数据泊松结构设计的非线性邻域嵌入方法,利用泊松分布之间的KL散度度量成对不相似性,并采用Hellinger距离优化嵌入。我们在合成泊松数据上测试了p-SNE,并展示了其在真实世界计数数据集中恢复有意义结构的能力,包括电子邮件通信中的工作日模式、OpenReview论文中的研究领域聚类,以及神经脉冲记录中的时间漂移和刺激梯度。