Developers frequently reuse Stack Overflow code snippets, yet the quality of these snippets remains unevenly understood, particularly across programming languages and geographic contexts. This study investigates code quality in Stack Overflow answers from contributors located in the United States, focusing on SQL, JavaScript, Python, Ruby, and Java snippets. We evaluate four quality dimensions: reliability, readability, performance, and security. Using language-specific linting and static analysis tools, we quantify violations across states and cities, compute violation densities to enable fair regional comparison, and examine relationships between code quality and state-level diversity indicators. We further conduct inductive content analysis on code snippets from California, Utah, and North Dakota to identify qualitative patterns in code quality violations. Results show that readability violations are the most prevalent across all languages, followed by reliability, performance, and security. Common issues include improper whitespace, inconsistent formatting, program-flow errors, inefficient resource use, unsanitised inputs, and insecure dynamic evaluation. Regional analysis indicates that major technology hubs produce more parsable snippets but do not necessarily exhibit higher violation densities. States with broader access to computing devices, Internet subscriptions, higher income, and more equitable wealth distribution tend to show fewer code quality violations. Qualitative findings suggest that established technology regions often produce more complex violations, while less mature technology regions display more fundamental errors. These findings highlight the socio-technical nature of code quality in community question-answering platforms and suggest that developers should exercise caution when reusing online code snippets.
翻译:摘要:开发者频繁复用Stack Overflow上的代码片段,但其质量在不同编程语言和地理背景下仍存在不均衡的理解。本研究聚焦美国贡献者发布的Stack Overflow回答,针对SQL、JavaScript、Python、Ruby和Java代码片段展开质量评估。我们从可靠性、可读性、性能及安全性四个维度进行质量测度,采用语言专属的静态分析与语法检测工具统计各州及城市层面的违规情况,通过计算违规密度实现公平区域对比,并考察代码质量与州级多样性指标之间的关联。进一步地,我们对加利福尼亚、犹他和北达科他州的代码片段进行归纳性内容分析,以识别代码质量违规的定性模式。结果表明:可读性违规在所有语言中最为普遍,其次依次为可靠性、性能与安全性问题;常见问题包括不当空白符、格式不一致、程序流错误、资源使用低效、输入未净化及非安全动态求值。区域分析显示,主要技术枢纽地区产生的可解析代码片段更多,但违规密度并非必然更低。计算机设备普及率、互联网订阅率、收入水平及财富分配公平度更高的州,其代码质量违规现象通常更少。定性研究揭示,成熟技术区域往往产生复杂度更高的违规类型,而技术欠成熟区域则显现更多基础性错误。这些发现揭示了社区问答平台中代码质量的社会技术属性,提示开发者在复用在线代码片段时应保持审慎态度。