We present HARPT, a large-scale annotated corpus of mobile health app store reviews aimed at advancing research in user privacy and trust. The dataset comprises over 480,000 user reviews labeled into seven categories that capture critical aspects of trust in applications, trust in providers and privacy concerns. Creating HARPT required addressing multiple complexities, such as defining a nuanced label schema, isolating relevant content from large volumes of noisy data, and designing an annotation strategy that balanced scalability with accuracy. This strategy integrated rule-based filtering, iterative manual labeling with review, targeted data augmentation, and weak supervision using transformer-based classifiers to accelerate coverage. In parallel, a carefully curated subset of 7,000 reviews was manually annotated to support model development and evaluation. We benchmark a broad range of classification models, demonstrating that strong performance is achievable and providing a baseline for future research. HARPT is released as a public resource to support work in health informatics, cybersecurity, and natural language processing.
翻译:我们提出了HARPT,一个大规模标注的移动健康应用商店评论语料库,旨在推动用户隐私与信任领域的研究。该数据集包含超过48万条用户评论,被标注为七个类别,这些类别捕捉了应用信任、提供商信任以及隐私担忧等关键方面。构建HARPT需要应对多重复杂性,例如定义细致的标签体系、从海量噪声数据中分离出相关内容,以及设计一种在可扩展性与准确性之间取得平衡的标注策略。该策略整合了基于规则的过滤、带复核的迭代式人工标注、定向数据增强,以及使用基于Transformer的分类器进行弱监督以加速覆盖范围。同时,我们精心筛选并人工标注了7000条评论的子集,以支持模型开发与评估。我们对多种分类模型进行了基准测试,证明了实现高性能是可行的,并为未来研究提供了基线。HARPT已作为公共资源发布,以支持健康信息学、网络安全和自然语言处理领域的研究工作。