A Text Classification Framework for Simple and Effective Early Depression Detection Over Social Media Streams

from arxiv, Highlights: (*) A novel text classifier having the ability to visually explain its rationale; (*) Domain-independent classification that does not require feature engineering; (*) Support for incremental learning and text classification over streams; (*) Efficient framework for addressing early risk detection problems; (*) State-of-the-art performance on early depression detection task

With the rise of the Internet, there is a growing need to build intelligent systems that are capable of efficiently dealing with early risk detection (ERD) problems on social media, such as early depression detection, early rumor detection or identification of sexual predators. These systems, nowadays mostly based on machine learning techniques, must be able to deal with data streams since users provide their data over time. In addition, these systems must be able to decide when the processed data is sufficient to actually classify users. Moreover, since ERD tasks involve risky decisions by which people's lives could be affected, such systems must also be able to justify their decisions. However, most standard and state-of-the-art supervised machine learning models are not well suited to deal with this scenario. This is due to the fact that they either act as black boxes or do not support incremental classification/learning. In this paper we introduce SS3, a novel supervised learning model for text classification that naturally supports these aspects. SS3 was designed to be used as a general framework to deal with ERD problems. We evaluated our model on the CLEF's eRisk2017 pilot task on early depression detection. Most of the 30 contributions submitted to this competition used state-of-the-art methods. Experimental results show that our classifier was able to outperform these models and standard classifiers, despite being less computationally expensive and having the ability to explain its rationale.

翻译：随着互联网的兴起，构建能够高效处理社交媒体早期风险检测（ERD）问题（如早期抑郁检测、早期谣言检测或性侵犯者识别）的智能系统需求日益增长。这些系统如今大多基于机器学习技术，必须能够处理用户随时间提供的数据流。此外，这些系统必须能判断何时处理的数据足以实现用户分类。同时，由于ERD任务涉及可能影响人们生活的风险决策，此类系统还需具备决策解释能力。然而，大多数标准及前沿的监督机器学习模型并不适用于这种场景，原因在于它们要么作为黑箱运行，要么不支持增量分类/学习。本文提出SS3——一种自然支持上述特性的新型文本分类监督学习模型。SS3被设计为处理ERD问题的通用框架。我们在CLEF eRisk2017早期抑郁检测试点任务上评估了模型。该竞赛共收到30个参赛方案，大部分采用前沿方法。实验结果表明，尽管我们的分类器计算成本更低且能解释其决策逻辑，但其性能仍优于这些模型及标准分类器。