We present a large-scale, longitudinal dataset capturing user activity on the online platform of DerStandard, a major Austrian newspaper. The dataset spans ten years (2013-2022) and includes over 75 million user comments, more than 400 million votes, and detailed metadata on articles and user interactions. It provides structured conversation threads, explicit up- and downvotes of user comments and editorial topic labels, enabling rich analyses of online discourse while preserving user privacy. To ensure this privacy, all persistent identifiers are anonymized using salted hash functions, and the raw comment texts are not publicly shared. Instead, we release pre-computed vector representations derived from a state-of-the-art embedding model. The dataset supports research on discussion dynamics, network structures, and semantic analyses in the mid-resourced language German, offering a reusable resource across computational social science and related fields.
翻译:我们提出了一个大规模纵向数据集,记录了奥地利主要报纸《DerStandard》在线平台的用户活动。该数据集跨越十年(2013-2022年),包含超过7500万条用户评论、逾4亿次投票,以及文章和用户互动的详细元数据。它提供了结构化的对话主题、用户评论的明确赞成与反对投票以及编辑主题标签,从而支持对在线讨论的丰富分析,同时保护用户隐私。为确保隐私,所有持久标识符均使用加盐哈希函数进行匿名化处理,原始评论文本不予公开共享。相反,我们发布了基于前沿嵌入模型计算得到的向量表示。该数据集支持对中等资源语言德语中的讨论动态、网络结构和语义分析进行研究,为计算社会科学及相关领域提供了一个可重复使用的资源。