Systems and individuals produce data continuously. On the Internet, people share their knowledge, sentiments, and opinions, provide reviews about services and products, and so on. Automatically learning from these textual data can provide insights to organizations and institutions, thus preventing financial impacts, for example. To learn from textual data over time, the machine learning system must account for concept drift. Concept drift is a frequent phenomenon in real-world datasets and corresponds to changes in data distribution over time. For instance, a concept drift occurs when sentiments change or a word's meaning is adjusted over time. Although concept drift is frequent in real-world applications, benchmark datasets with labeled drifts are rare in the literature. To bridge this gap, this paper provides four textual drift generation methods to ease the production of datasets with labeled drifts. These methods were applied to Yelp and Airbnb datasets and tested using incremental classifiers respecting the stream mining paradigm to evaluate their ability to recover from the drifts. Results show that all methods have their performance degraded right after the drifts, and the incremental SVM is the fastest to run and recover the previous performance levels regarding accuracy and Macro F1-Score.
翻译:系统和个体不断产生数据。在互联网上,人们分享知识、情感和观点,对服务与产品进行评论等。从这些文本数据中自动学习可以为组织与机构提供洞察,从而避免财务影响等。要随时间从文本数据中学习,机器学习系统必须考虑概念漂移。概念漂移是现实数据集中常见的现象,对应数据分布随时间的变化。例如,当情感发生变化或词汇含义随时间调整时,就会发生概念漂移。尽管概念漂移在现实应用中频繁出现,但带有标注漂移的基准数据集在文献中却十分稀缺。为弥补这一不足,本文提供了四种文本漂移生成方法,以简化带有标注漂移的数据集的生成过程。这些方法应用于Yelp和Airbnb数据集,并通过遵循流挖掘范式的增量分类器进行测试,以评估其从漂移中恢复的能力。结果表明,所有方法在漂移发生后性能均出现下降,而增量支持向量机在运行速度以及准确率和宏F1分数方面恢复先前性能水平最快。