We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.
翻译:我们推出“Testimole-conversational”,一个大规模的意大利语讨论板消息集合。该语料库规模庞大,包含超过300亿词元(1996-2024年),使其成为训练原生意大利语大语言模型的理想数据集。此外,讨论板消息也是语言分析和社会学分析的重要资源。该语料库捕捉了丰富多样的计算机中介交流,为研究非正式书面意大利语、话语动态以及长时间跨度内的在线社交互动提供了深刻见解。除了对语言建模、领域适应和会话分析等自然语言处理应用具有重要意义外,它还支持对数字通信中的语言变异和社会现象进行研究。该资源将免费提供给研究社区使用。