We introduce TACO, an open-source, large-scale code generation dataset, with a focus on the optics of algorithms, designed to provide a more challenging training dataset and evaluation benchmark in the field of code generation models. TACO includes competition-level programming questions that are more challenging, to enhance or evaluate problem understanding and reasoning abilities in real-world programming scenarios. There are 25433 and 1000 coding problems in training and test set, as well as up to 1.55 million diverse solution answers. Moreover, each TACO problem includes several fine-grained labels such as task topics, algorithms, programming skills, and difficulty levels, providing a more precise reference for the training and evaluation of code generation models. The dataset and evaluation scripts are available on Hugging Face Hub (https://huggingface.co/datasets/BAAI/TACO) and Github (https://github.com/FlagOpen/TACO).
翻译:我们推出TACO,这是一个开源的大规模代码生成数据集,聚焦算法视角,旨在为代码生成模型领域提供更具挑战性的训练数据集和评估基准。TACO包含难度更高的竞赛级编程问题,以增强或评估真实编程场景中的问题理解与推理能力。训练集和测试集分别包含25433个和1000个编码问题,且配有高达155万种多样化解决方案。此外,每个TACO问题均标注了多个细粒度标签(如任务主题、算法、编程技能及难度级别),为代码生成模型的训练与评估提供更精准的参考。该数据集及评估脚本已在Hugging Face Hub(https://huggingface.co/datasets/BAAI/TACO)和GitHub(https://github.com/FlagOpen/TACO)上开源。