This study evaluates the performance of TF-IDF weighting, averaged Word2Vec embeddings, and BERT embeddings for document similarity scoring across two contrasting textual domains. By analysing cosine similarity scores, the methods' strengths and limitations are highlighted. The findings underscore TF-IDF's reliance on lexical overlap and Word2Vec's superior semantic generalisation, particularly in cross-domain comparisons. BERT demonstrates lower performance in challenging domains, likely due to insufficient domainspecific fine-tuning.
翻译:本研究评估了TF-IDF加权、平均Word2Vec嵌入和BERT嵌入在两种对比文本领域中进行文档相似性评分的性能。通过分析余弦相似度得分,揭示了各方法的优势与局限性。研究结果突显了TF-IDF对词汇重叠的依赖性以及Word2Vec在语义泛化方面的优越性,尤其在跨领域比较中更为显著。BERT在具有挑战性的领域中表现出较低性能,这很可能源于领域特定微调的不足。