Influence of various text embeddings on clustering performance in NLP

With the advent of e-commerce platforms, reviews are crucial for customers to assess the credibility of a product. The star ratings do not always match the review text written by the customer. For example, a three star rating (out of five) may be incongruous with the review text, which may be more suitable for a five star review. A clustering approach can be used to relabel the correct star ratings by grouping the text reviews into individual groups. In this work, we explore the task of choosing different text embeddings to represent these reviews and also explore the impact the embedding choice has on the performance of various classes of clustering algorithms. We use contextual (BERT) and non-contextual (Word2Vec) text embeddings to represent the text and measure their impact of three classes on clustering algorithms - partitioning based (KMeans), single linkage agglomerative hierarchical, and density based (DBSCAN and HDBSCAN), each with various experimental settings. We use the silhouette score, adjusted rand index score, and cluster purity score metrics to evaluate the performance of the algorithms and discuss the impact of different embeddings on the clustering performance. Our results indicate that the type of embedding chosen drastically affects the performance of the algorithm, the performance varies greatly across different types of clustering algorithms, no embedding type is better than the other, and DBSCAN outperforms KMeans and single linkage agglomerative clustering but also labels more data points as outliers. We provide a thorough comparison of the performances of different algorithms and provide numerous ideas to foster further research in the domain of text clustering.

翻译：随着电商平台的兴起，用户评论对消费者评估产品可信度至关重要。然而，星级评分与用户撰写的评论文本往往并不匹配。例如，三星评分（满分五星）可能与更契合五星评价的评论文本存在矛盾。聚类方法可通过将评论文本分组来重新标注正确的星级评分。本文探索了选择不同文本嵌入表示评论的任务，并分析了嵌入选择对各类聚类算法性能的影响。我们采用上下文感知（BERT）和非上下文感知（Word2Vec）文本嵌入表示文本，并测量其对三类聚类算法的影响：基于划分的（KMeans）、单链接凝聚层次聚类、以及基于密度的（DBSCAN和HDBSCAN），每种算法均设置多种实验参数。通过轮廓系数、调整兰德指数和簇纯度评分等指标评估算法性能，并讨论不同嵌入对聚类性能的影响。实验结果表明：所选嵌入类型显著影响算法性能；不同聚类算法间的性能差异巨大；无任何嵌入类型具有绝对优势；DBSCAN虽优于KMeans和单链接凝聚聚类，但会将更多数据点标记为离群点。我们提供了不同算法性能的全面对比，并提出了多项促进文本聚类领域进一步研究的建议。