Simulation is a crucial tool for the evaluation and comparison of statistical methods. How to design fair and neutral simulation studies is therefore of great interest for researchers developing new methods and practitioners confronted with the choice of the most suitable method. The term simulation usually refers to parametric simulation, that is, computer experiments using artificial data made up of pseudo-random numbers. Plasmode simulation, that is, computer experiments using the combination of resampling feature data from a real-life dataset and generating the target variable with a known user-selected outcome-generating model (OGM), is an alternative that is often claimed to produce more realistic data. We compare parametric and Plasmode simulation for the example of estimating the mean squared error (MSE) of the least squares estimator (LSE) in linear regression. If the true underlying data-generating process (DGP) and the OGM were known, parametric simulation would obviously be the best choice in terms of estimating the MSE well. However, in reality, both are usually unknown, so researchers have to make assumptions: in Plasmode simulation for the OGM, in parametric simulation for both DGP and OGM. Most likely, these assumptions do not exactly reflect the truth. Here, we aim to find out how assumptions deviating from the true DGP and the true OGM affect the performance of parametric and Plasmode simulations in the context of MSE estimation for the LSE and in which situations which simulation type is preferable. Our results suggest that the preferable simulation method depends on many factors, including the number of features, and on how and to what extent the assumptions of a parametric simulation differ from the true DGP. Also, the resampling strategy used for Plasmode influences the results. In particular, subsampling with a small sampling proportion can be recommended.
翻译:仿真是评估和比较统计方法的关键工具。因此,如何设计公平且中立的仿真研究,对于开发新方法的研究人员以及面临选择最合适方法的实践者而言具有重要意义。术语“仿真”通常指参数模拟,即使用伪随机数生成人工数据的计算机实验。Plasmode模拟——即结合从真实数据集中重采样特征数据,并通过已知用户选择的结果生成模型(OGM)生成目标变量的计算机实验——是一种常被认为能产生更真实数据的替代方案。我们以估计线性回归中最小二乘估计器(LSE)的均方误差(MSE)为例,比较了参数模拟与Plasmode模拟。如果真实的基础数据生成过程(DGP)和OGM已知,参数模拟显然是在准确估计MSE方面的最佳选择。然而,现实中两者通常均未知,因此研究人员必须做出假设:在Plasmode模拟中需假设OGM,在参数模拟中则需同时假设DGP和OGM。这些假设很可能无法完全反映真实情况。本文旨在探究,在LSE的MSE估计背景下,偏离真实DGP和真实OGM的假设如何影响参数模拟与Plasmode模拟的性能,以及在何种情况下何种模拟类型更优。我们的结果表明,更优的模拟方法取决于多种因素,包括特征数量,以及参数模拟的假设与真实DGP的差异方式和程度。此外,Plasmode模拟中使用的重采样策略也会影响结果。特别地,我们推荐采用较小抽样比例的子抽样方法。