On the uses and abuses of regression models: a call for reform of statistical practice and teaching

from arxiv, 24 pages main document including 3 figures, plus 15 pages supplementary material. Based on plenary lecture (President's Invited Speaker) delivered to ISCB43, Newcastle, UK, August 2022. Submitted for publication 12-Sep-23

When students and users of statistical methods first learn about regression analysis there is an emphasis on the technical details of models and estimation methods that invariably runs ahead of the purposes for which these models might be used. More broadly, statistics is widely understood to provide a body of techniques for "modelling data", underpinned by what we describe as the "true model myth", according to which the task of the statistician/data analyst is to build a model that closely approximates the true data generating process. By way of our own historical examples and a brief review of mainstream clinical research journals, we describe how this perspective leads to a range of problems in the application of regression methods, including misguided "adjustment" for covariates, misinterpretation of regression coefficients and the widespread fitting of regression models without a clear purpose. We then outline an alternative approach to the teaching and application of regression methods, which begins by focussing on clear definition of the substantive research question within one of three distinct types: descriptive, predictive, or causal. The simple univariable regression model may be introduced as a tool for description, while the development and application of multivariable regression models should proceed differently according to the type of question. Regression methods will no doubt remain central to statistical practice as they provide a powerful tool for representing variation in a response or outcome variable as a function of "input" variables, but their conceptualisation and usage should follow from the purpose at hand.

翻译：当统计方法的学生和用户初次学习回归分析时，模型与估计方法的技术细节往往被重点强调，而这些内容总是先于模型可能用于何种目的而讲授。更广泛而言，统计学普遍被视为提供一套"数据建模"的技术体系，其基础是我们所称的"真实模型迷思"——即统计学家/数据分析师的任务是构建一个尽可能接近真实数据生成过程的模型。通过我们自身的历史案例以及对主流临床研究期刊的简要回顾，我们描述了这种视角如何导致回归方法应用中的一系列问题，包括对协变量进行误导性的"调整"、对回归系数的误读，以及在没有明确目的的情况下广泛拟合回归模型。随后，我们概述了一种回归方法教学与应用的替代路径，该路径首先聚焦于在三种不同类型（描述性、预测性或因果性）中清晰定义实质性研究问题。简单单变量回归模型可作为描述性工具引入，而多变量回归模型的开发与应用则应依据问题类型采取不同的策略。回归方法无疑仍将是统计实践的核心，因为它们提供了将响应或结果变量的变异表示为"输入"变量函数的强大工具，但其概念化与应用应始终围绕待解决问题的核心目的展开。