A Decade of News Forum Interactions: Threaded Conversations, Signed Votes, and Topical Tags

We present a large-scale, longitudinal dataset capturing user activity on the online platform of DerStandard, a major Austrian newspaper. The dataset spans ten years (2013-2022) and includes over 75 million user comments, more than 400 million votes, and detailed metadata on articles and user interactions. It provides structured conversation threads, explicit up- and downvotes of user comments and editorial topic labels, enabling rich analyses of online discourse while preserving user privacy. To ensure this privacy, all persistent identifiers are anonymized using salted hash functions, and the raw comment texts are not publicly shared. Instead, we release pre-computed vector representations derived from a state-of-the-art embedding model. The dataset supports research on discussion dynamics, network structures, and semantic analyses in the mid-resourced language German, offering a reusable resource across computational social science and related fields.

翻译：我们提出了一个大规模纵向数据集，记录了奥地利主要报纸《DerStandard》在线平台的用户活动。该数据集跨越十年（2013-2022年），包含超过7500万条用户评论、逾4亿次投票，以及文章和用户互动的详细元数据。它提供了结构化的对话主题、用户评论的明确赞成与反对投票以及编辑主题标签，从而支持对在线讨论的丰富分析，同时保护用户隐私。为确保隐私，所有持久标识符均使用加盐哈希函数进行匿名化处理，原始评论文本不予公开共享。相反，我们发布了基于前沿嵌入模型计算得到的向量表示。该数据集支持对中等资源语言德语中的讨论动态、网络结构和语义分析进行研究，为计算社会科学及相关领域提供了一个可重复使用的资源。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《DIVERSE：基于视频评论立场分析解读互联网对美国军事的看法——立场分类的新基准数据集》最新论文

专知会员服务

21+阅读 · 2024年3月18日

规模达10.32亿！（CNNIC）发布第49次《中国互联网络发展状况统计报告》，101页pdf

专知会员服务

86+阅读 · 2022年3月27日

【WWW2021】合作记忆网络的个性化任务导向对话系统

专知会员服务

15+阅读 · 2021年2月17日

10篇百度KDD2020论文: 知识图谱、智能交通、智能推荐、图神经网络、人机交互、科学防疫

专知会员服务

61+阅读 · 2020年7月26日