Depression detection from user-generated content on the internet has been a long-lasting topic of interest in the research community, providing valuable screening tools for psychologists. The ubiquitous use of social media platforms lays out the perfect avenue for exploring mental health manifestations in posts and interactions with other users. Current methods for depression detection from social media mainly focus on text processing, and only a few also utilize images posted by users. In this work, we propose a flexible time-enriched multimodal transformer architecture for detecting depression from social media posts, using pretrained models for extracting image and text embeddings. Our model operates directly at the user-level, and we enrich it with the relative time between posts by using time2vec positional embeddings. Moreover, we propose another model variant, which can operate on randomly sampled and unordered sets of posts to be more robust to dataset noise. We show that our method, using EmoBERTa and CLIP embeddings, surpasses other methods on two multimodal datasets, obtaining state-of-the-art results of 0.931 F1 score on a popular multimodal Twitter dataset, and 0.902 F1 score on the only multimodal Reddit dataset.
翻译:摘要:从互联网用户生成内容中检测抑郁症一直是研究社区长期关注的课题,为心理学家提供了宝贵的筛查工具。社交媒体平台的广泛使用为探讨用户帖子及其互动中的心理健康表现提供了理想途径。当前基于社交媒体的抑郁症检测方法主要聚焦于文本处理,仅有少数方法同时利用了用户发布的图像。本文提出了一种灵活的时间增强多模态Transformer架构,用于从社交媒体帖子中检测抑郁症,采用预训练模型提取图像和文本嵌入。我们的模型直接在用户层面运行,并通过使用time2vec位置嵌入来增强帖子间的相对时间信息。此外,我们提出了另一种模型变体,可对随机采样且无序的帖子集进行处理,从而增强对数据集噪声的鲁棒性。实验表明,我们的方法(采用EmoBERTa和CLIP嵌入)在两个多模态数据集上超越了其他方法,在广泛使用的多模态Twitter数据集上取得了0.931 F1分数的最优结果,在唯一的Reddit多模态数据集上取得了0.902 F1分数的最优结果。