A Tertiary Review of Large Language Model-Based Code Generating Tasks: Trends, Challenges, and Future Directions

Context. Large language models (LLMs) are increasingly applied to code-generating tasks (CGTs) in software engineering. While reported results are promising, the broader effects of such application and their integration into real-world development remain insufficiently understood with existing tertiary studies provide little in this area. Objective. This tertiary study consolidates secondary evidence on LLM-based CGTs, synthesizing the publication landscape, effects, scenarios, integration challenges, and future research directions. Method. Following systematic review guidelines, we searched in related digital libraries, complemented by backward-and-forward snowballing and screening step. Study quality was assessed and extraction reliability was audited with inter-rater agreement statistics. Evidence was synthesized using SWEBOK knowledge areas and the HELM framework. Results. We identify 30 secondary studies published between 2017-2025, with rapid growth since 2023. Accuracy seems strong on benchmarks but weakly supported for real-world generalization; robustness is fragile across tasks and configurations; efficiency constraints are pervasive; toxicity and bias are under-reported. Dominant challenges concern economic feasibility, evaluation validity, and socio-technical integration. Future directions suggest domain-aware model improvement and the need for holistic, standardized evaluation. Conclusion. LLM-based CGTs represent a fast-maturing yet unevenly evaluated research area, highlighting the need for domain-aware model improvements and holistic, standardized evaluation, addressing efficiency and associated costs.

翻译：背景。大语言模型（LLMs）在软件工程领域中越来越多地被应用于代码生成任务（CGTs）。尽管现有结果颇具前景，但其更广泛的效应及在实际开发中的融合仍未被充分理解，现有的三级研究对此领域涉及甚少。目标。本研究通过三级综合分析方法，整合基于LLM的CGTs的次级证据，系统梳理其发表概况、影响效果、应用场景、集成挑战及未来研究方向。方法。遵循系统综述指南，我们在相关数字图书馆中进行检索，并辅以前后向滚雪球法与筛选步骤。对研究质量进行评估，并通过评分者间一致性统计对提取可靠性进行审计。采用SWEBOK知识领域与HELM框架进行证据综合。结果。我们识别出2017-2025年间发表的30篇次级研究，且自2023年以来呈现快速增长趋势。在基准测试中准确率表现突出，但现实世界泛化能力的支撑证据薄弱；鲁棒性在不同任务与配置下较为脆弱；效率约束普遍存在；毒性与偏见问题报道不足。主要挑战涉及经济可行性、评估有效性及社会技术集成。未来方向包括领域感知模型改进与整体化、标准化评估需求。结论。基于LLM的CGTs代表了一个快速成熟但评估不均的研究领域，凸显了领域感知模型改进、整体化标准化评估以及效率与成本问题解决的必要性。