Blog Data Showdown: Machine Learning vs Neuro-Symbolic Models for Gender Classification

Text classification problems, such as gender classification from a blog, have been a well-matured research area that has been well studied using machine learning algorithms. It has several application domains in market analysis, customer recommendation, and recommendation systems. This study presents a comparative analysis of the widely used machine learning algorithms, namely Support Vector Machines (SVM), Naive Bayes (NB), Logistic Regression (LR), AdaBoost, XGBoost, and an SVM variant (SVM_R) with neuro-symbolic AI (NeSy). The paper also explores the effect of text representations such as TF-IDF, the Universal Sentence Encoder (USE), and RoBERTa. Additionally, various feature extraction techniques, including Chi-Square, Mutual Information, and Principal Component Analysis, are explored. Building on these, we introduce a comparative analysis of the machine learning and deep learning approaches in comparison to the NeSy. The experimental results show that the use of the NeSy approach matched strong MLP results despite a limited dataset. Future work on this research will expand the knowledge base, the scope of embedding types, and the hyperparameter configuration to further study the effectiveness of the NeSy approach.

翻译：文本分类问题，例如从博客中识别性别，已成为一个成熟的研究领域，并已通过机器学习算法得到充分研究。该技术在市场分析、客户推荐和推荐系统等多个应用领域中具有重要价值。本研究对广泛使用的机器学习算法——支持向量机（SVM）、朴素贝叶斯（NB）、逻辑回归（LR）、AdaBoost、XGBoost以及一种SVM变体（SVM_R）与神经符号人工智能（NeSy）进行了比较分析。本文还探讨了文本表示方法（如TF-IDF、通用句子编码器（USE）和RoBERTa）的影响。此外，研究涵盖了多种特征提取技术，包括卡方检验、互信息和主成分分析。在此基础上，我们引入了机器学习与深度学习方法相对于NeSy的比较分析。实验结果表明，尽管数据集有限，采用NeSy方法仍能达到与多层感知器（MLP）相当的性能。本研究的未来工作将扩展知识库、嵌入类型的范围以及超参数配置，以进一步探究NeSy方法的有效性。