This paper discusses the impact of the Internet on modern trading and the importance of data generated from these transactions for organizations to improve their marketing efforts. The paper uses the example of Divar, an online marketplace for buying and selling products and services in Iran, and presents a competition to predict the percentage of a car sales ad that would be published on the Divar website. Since the dataset provides a rich source of Persian text data, the authors use the Hazm library, a Python library designed for processing Persian text, and two state-of-the-art language models, mBERT and ParsBERT, to analyze it. The paper's primary objective is to compare the performance of mBERT and ParsBERT on the Divar dataset. The authors provide some background on data mining, Persian language, and the two language models, examine the dataset's composition and statistical features, and provide details on their fine-tuning and training configurations for both approaches. They present the results of their analysis and highlight the strengths and weaknesses of the two language models when applied to Persian text data. The paper offers valuable insights into the challenges and opportunities of working with low-resource languages such as Persian and the potential of advanced language models like BERT for analyzing such data. The paper also explains the data mining process, including steps such as data cleaning and normalization techniques. Finally, the paper discusses the types of machine learning problems, such as supervised, unsupervised, and reinforcement learning, and the pattern evaluation techniques, such as confusion matrix. Overall, the paper provides an informative overview of the use of language models and data mining techniques for analyzing text data in low-resource languages, using the example of the Divar dataset.
翻译:本文探讨了互联网对现代交易的影响,以及由此产生的数据对组织优化营销策略的重要性。研究以伊朗在线商品与服务交易平台Divar为例,介绍了一项旨在预测汽车销售广告在Divar网站发布百分比的竞赛。由于数据集提供了丰富的波斯语文本资源,作者采用专为波斯语文本处理设计的Python库Hazm,以及两种前沿语言模型mBERT和ParsBERT进行分析。本文的主要目标是比较mBERT与ParsBERT在Divar数据集上的性能表现。作者首先介绍了数据挖掘、波斯语言特性及两种语言模型的背景知识,随后分析了数据集构成与统计特征,并详细阐述了两种方法的微调与训练配置。基于分析结果,作者揭示了两种语言模型处理波斯语文本时的优势与不足。该研究为处理波斯语等低资源语言所面临的挑战与机遇提供了重要见解,同时展现了BERT等先进语言模型在分析此类数据中的潜力。此外,本文还说明了数据挖掘流程,包括数据清洗与标准化技术,并讨论了监督学习、无监督学习与强化学习等机器学习问题类型,以及混淆矩阵等模式评估方法。总体而言,本文以Divar数据集为例,系统性地介绍了利用语言模型与数据挖掘技术分析低资源语言文本数据的实践框架。