This work introduces a benchmark assessing the performance of clustering German text embeddings in different domains. This benchmark is driven by the increasing use of clustering neural text embeddings in tasks that require the grouping of texts (such as topic modeling) and the need for German resources in existing benchmarks. We provide an initial analysis for a range of pre-trained mono- and multilingual models evaluated on the outcome of different clustering algorithms. Results include strong performing mono- and multilingual models. Reducing the dimensions of embeddings can further improve clustering. Additionally, we conduct experiments with continued pre-training for German BERT models to estimate the benefits of this additional training. Our experiments suggest that significant performance improvements are possible for short text. All code and datasets are publicly available.
翻译:本文针对不同领域下德语文本嵌入的聚类性能评估提出了一项基准测试。该基准的建立源于:聚类神经文本嵌入在需要文本分组(如主题建模)的任务中日益广泛的应用,以及现有基准测试中德语资源的需求缺口。我们针对多种预训练的单语及多语言模型进行了初步分析,评估了不同聚类算法所得结果。实验结果表明,高性能的单语及多语言模型均存在,而降低嵌入维度可进一步优化聚类效果。此外,我们通过持续预训练实验对德语BERT模型进行了测试,以评估额外训练带来的收益。实验表明,对于短文本,模型性能可实现显著提升。所有代码与数据集均已公开提供。