Conversational music recommendation (CMR) research currently faces a tradeoff between authentic dialogue corpora that are limited in scale and synthesized corpora that scale up but whose conversations are artificially constructed rather than naturally observed. In this paper, we introduce Reddit2Deezer, a reality-grounded CMR resource derived from 190k unique {thread, leaf-comment} pairs. We release the resource in two versions: a raw version that preserves authenticity, and a paraphrased version that maximizes long-term reproducibility. Each musical entity is linked to a Deezer identifier, which provides straightforward access to audio previews and rich metadata (e.g., genre tags, popularity, BPM), opening the door to future research on content-grounded conversational recommendation. A human validation confirms the quality of the dialogues, item grounding, and paraphrases. The dataset is available at https://huggingface.co/datasets/McAuley-Lab/Reddit2Deezer.
翻译:对话式音乐推荐(CMR)研究目前面临一个权衡:一方面,真实的对话语料库规模有限;另一方面,合成的语料库可以扩展规模,但其中的对话是人为构建而非自然观察得到的。本文介绍了Reddit2Deezer,一种源自19万个独特{话题、叶评论}对的基于现实语境的CMR资源。我们发布了两个版本的数据集:一个是保持真实性的原始版本,另一个是最大化长期可复现性的改写版本。每个音乐实体都链接到Deezer标识符,从而可以轻松访问音频预览和丰富的元数据(例如,流派标签、流行度、BPM),为未来基于内容语境的对话推荐研究打开了大门。人工验证确认了对话质量、实体语境绑定以及改写的准确性。该数据集可在https://huggingface.co/datasets/McAuley-Lab/Reddit2Deezer获取。