When is Plasmode simulation superior to parametric simulation when estimating the MSE of the least squares estimator in linear regression?

Simulation is a crucial tool for the evaluation and comparison of statistical methods. How to design fair and neutral simulation studies is therefore of great interest for researchers developing new methods and practitioners confronted with the choice of the most suitable method. The term simulation usually refers to parametric simulation, that is, computer experiments using artificial data made up of pseudo-random numbers. Plasmode simulation, that is, computer experiments using the combination of resampling feature data from a real-life dataset and generating the target variable with a known user-selected outcome-generating model (OGM), is an alternative that is often claimed to produce more realistic data. We compare parametric and Plasmode simulation for the example of estimating the mean squared error (MSE) of the least squares estimator (LSE) in linear regression. If the true underlying data-generating process (DGP) and the OGM were known, parametric simulation would obviously be the best choice in terms of estimating the MSE well. However, in reality, both are usually unknown, so researchers have to make assumptions: in Plasmode simulation for the OGM, in parametric simulation for both DGP and OGM. Most likely, these assumptions do not exactly reflect the truth. Here, we aim to find out how assumptions deviating from the true DGP and the true OGM affect the performance of parametric and Plasmode simulations in the context of MSE estimation for the LSE and in which situations which simulation type is preferable. Our results suggest that the preferable simulation method depends on many factors, including the number of features, and on how and to what extent the assumptions of a parametric simulation differ from the true DGP. Also, the resampling strategy used for Plasmode influences the results. In particular, subsampling with a small sampling proportion can be recommended.

翻译：模拟是评估和比较统计方法的重要工具。因此，如何设计公平且中立的模拟研究，对于开发新方法的研究人员以及面临选择最合适方法挑战的实践者而言，具有重大意义。“模拟”通常指参数模拟，即利用由伪随机数构成的人工数据进行计算机实验。而Plasmode模拟——结合从真实数据集中重采样特征数据，并利用已知用户选择的结果生成模型生成目标变量的计算机实验——则是一种常被认为能产生更真实数据的替代方案。我们以线性回归中最小二乘估计量的均方误差估计为例，比较了参数模拟与Plasmode模拟。若真实底层数据生成过程与OGM已知，参数模拟显然是在MSE估计效果上的最佳选择。然而现实中，两者通常未知，因此研究者必须做出假设：Plasmode模拟需假设OGM，参数模拟则需同时假设DGP和OGM。这些假设很可能无法完全反映真实情况。本研究旨在探究偏离真实DGP与真实OGM的假设如何影响参数模拟和Plasmode模拟在LSE的MSE估计场景中的表现，并明确何种情境下哪种模拟类型更优。结果表明，优选模拟方法取决于多种因素，包括特征数量，以及参数模拟的假设与真实DGP在方式和程度上的差异。此外，Plasmode模拟采用的重采样策略也会影响结果。特别地，建议使用小采样比例的子采样方法。