In this technical report, we present three novel datasets of Kotlin code: KStack, KStack-clean, and KExercises. We also describe the results of fine-tuning CodeLlama and DeepSeek models on this data. Additionally, we present a version of the HumanEval benchmark rewritten by human experts into Kotlin - both the solutions and the tests. Our results demonstrate that small, high-quality datasets (KStack-clean and KExercises) can significantly improve model performance on code generation tasks, achieving up to a 16-point increase in pass rate on the HumanEval benchmark. Lastly, we discuss potential future work in the field of improving language modeling for Kotlin, including the use of static analysis tools in the learning process and the introduction of more intricate and realistic benchmarks.
翻译:在本技术报告中,我们提出了三个新颖的Kotlin代码数据集:KStack、KStack-clean和KExercises。我们还描述了在此数据上对CodeLlama和DeepSeek模型进行微调的结果。此外,我们提出了一个由人类专家重写为Kotlin的HumanEval基准测试版本——包括解决方案和测试用例。我们的结果表明,小型、高质量的数据集(KStack-clean和KExercises)能显著提升模型在代码生成任务上的性能,在HumanEval基准测试中实现了高达16个百分点的通过率提升。最后,我们探讨了未来在改进Kotlin语言建模方面的潜在工作方向,包括在学习过程中使用静态分析工具,以及引入更复杂和更贴近现实的基准测试。