We introduce the Emergent Language Corpus Collection (ELCC): a collection of corpora collected from open source implementations of emergent communication systems across the literature. These systems include a variety of signalling game environments as well as more complex tasks like a social deduction game and embodied navigation. Each corpus is annotated with metadata describing the characteristics of the source system as well as a suite of analyses of the corpus (e.g., size, entropy, average message length). Currently, research studying emergent languages requires directly running different systems which takes time away from actual analyses of such languages, limits the variety of languages that are studied, and presents a barrier to entry for researchers without a background in deep learning. The availability of a substantial collection of well-documented emergent language corpora, then, will enable new directions of research which focus their purview on the properties of emergent languages themselves rather than on experimental apparatus.
翻译:本文介绍涌现语言语料库集(ELCC):一个从文献中各类开源涌现通信系统实现所收集的语料库集合。这些系统涵盖多种信号博弈环境,以及更复杂的任务,如社交推理游戏和具身导航。每个语料库均附有描述源系统特征的元数据,以及一套针对语料库的分析指标(例如,规模、熵、平均消息长度)。目前,研究涌现语言需要直接运行不同的系统,这不仅耗费了本应用于语言实际分析的时间,限制了所研究语言的多样性,也为缺乏深度学习背景的研究者设置了入门障碍。因此,提供一个经过充分记录的大规模涌现语言语料库集合,将能开启新的研究方向,使研究焦点集中于涌现语言本身的特性,而非实验装置。