This study investigates the performance of three popular tokenization tools: MeCab, Sudachi, and SentencePiece, when applied as a preprocessing step for sentiment-based text classification of Japanese texts. Using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization, we evaluate two traditional machine learning classifiers: Multinomial Naive Bayes and Logistic Regression. The results reveal that Sudachi produces tokens closely aligned with dictionary definitions, while MeCab and SentencePiece demonstrate faster processing speeds. The combination of SentencePiece, TF-IDF, and Logistic Regression outperforms the other alternatives in terms of classification performance.
翻译:本研究探讨了三种流行的分词工具——MeCab、Sudachi和SentencePiece——在应用于日语文本基于情感的文本分类预处理步骤时的性能。通过使用词频-逆文档频率(TF-IDF)向量化方法,我们评估了两种传统的机器学习分类器:多项式朴素贝叶斯和逻辑回归。结果表明,Sudachi产生的分词结果与词典定义高度一致,而MeCab和SentencePiece则表现出更快的处理速度。在分类性能方面,SentencePiece、TF-IDF与逻辑回归的组合优于其他备选方案。