Logs with zeros? Some problems and solutions

When studying an outcome $Y$ that is weakly-positive but can equal zero (e.g. earnings), researchers frequently estimate an average treatment effect (ATE) for a "log-like" transformation that behaves like $\log(Y)$ for large $Y$ but is defined at zero (e.g. $\log(1+Y)$, $\mathrm{arcsinh}(Y)$). We argue that ATEs for log-like transformations should not be interpreted as approximating percentage effects, since unlike a percentage, they depend on the units of the outcome. In fact, we show that if the treatment affects the extensive margin, one can obtain a treatment effect of any magnitude simply by re-scaling the units of $Y$ before taking the log-like transformation. This arbitrary unit-dependence arises because an individual-level percentage effect is not well-defined for individuals whose outcome changes from zero to non-zero when receiving treatment, and the units of the outcome implicitly determine how much weight the ATE for a log-like transformation places on the extensive margin. We further establish a trilemma: when the outcome can equal zero, there is no treatment effect parameter that is an average of individual-level treatment effects, unit-invariant, and point-identified. We discuss several alternative approaches that may be sensible in settings with an intensive and extensive margin, including (i) expressing the ATE in levels as a percentage (e.g. using Poisson regression), (ii) explicitly calibrating the value placed on the intensive and extensive margins, and (iii) estimating separate effects for the two margins (e.g. using Lee bounds). We illustrate these approaches in three empirical applications.

翻译：当研究一个弱正但可能等于零的结果变量$Y$（如收入）时，研究者常针对一种"类对数"变换（该变换在$Y$较大时行为类似$\log(Y)$，但在零处有定义，例如$\log(1+Y)$、$\mathrm{arcsinh}(Y)$）估计平均处理效应（ATE）。我们论证，类对数变换的ATE不应被解释为近似百分比效应，因为与百分比不同，它们依赖于结果变量的单位。事实上，我们证明：若处理影响广延边际，则仅需在取类对数变换前重新缩放$Y$的单位，即可获得任意量级的处理效应。这种任意单位依赖性源于：对于处理导致结果从零变为非零的个体，个体层面的百分比效应无法明确定义，且结果变量的单位隐含决定了类对数变换的ATE在广延边际上赋予的权重。我们进一步确立了一个三难困境：当结果变量可能等于零时，不存在既是个体层面处理效应的均值、又具有单位不变性、且能被点识别的处理效应参数。我们讨论了在集约边际与广延边际并存情境下可能合理的若干替代方法，包括：（i）以百分比形式表达水平值的ATE（例如使用泊松回归），（ii）显式校准集约边际与广延边际的价值权重，以及（iii）分别估计两个边际的效应（例如使用Lee边界）。我们通过三个实证应用阐释了这些方法。

相关内容