Job Market Paper
Robust Caliper Tests
Caliper tests are widely used to test for the presence of p-hacking and publication bias based on the distribution of the z-statistics across studies. We show that without additional restrictions on the distribution of true effects, Caliper tests may suffer from substantial size distortions. We propose a modification of the existing Caliper test, referred to as the Robust Caliper test, which is shown to control size irrespective of the true effect distribution. We also propose a way of correcting the regression-based version of the Caliper test that allows for the inclusion of additional covariates. The proposed tests are easy to implement and perform well in practice.
Detecting p-hacking (with Graham Elliott and Kaspar Wüthrich), Econometrica, 2022
We theoretically analyze the problem of testing for p‐hacking based on distributions of p‐values across multiple studies. We provide general results for when such distributions have testable restrictions (are non‐increasing) under the null of no p‐hacking. We find novel additional testable restrictions for p‐values based on t‐tests. Specifically, the shape of the power functions results in both complete monotonicity as well as bounds on the distribution of p‐values. These testable restrictions result in more powerful tests for the null hypothesis of no p‐hacking. When there is also publication bias, our tests are joint tests for p‐hacking and publication bias. A reanalysis of two prominent data sets shows the usefulness of our new tests.
(When) Can We Detect p-hacking? (with Graham Elliott and Kaspar Wüthrich)
p-Hacking can undermine the validity of empirical studies. A flourishing empirical literature investigates the prevalence of p-hacking based on the empirical distribution of reported p-values across studies. Interpreting results in this literature requires a careful understanding of the power of methods used to detect different types of p-hacking. We theoretically study the implications of likely forms of p-hacking on the distribution of reported p-values and the power of existing methods for detecting it. Power can be quite low, depending crucially on the particular p-hacking strategy and the distribution of actual effects tested by the studies. We relate the power of the tests to the costs of p-hacking and show that power tends to be larger when p-hacking is very costly. Monte Carlo simulations support our theoretical results.