G Hopkins, Institute of Sport
Exercise and Active Living, Victoria University, Melbourne, Australia. Email.
In the article on sample-size estimation (Hopkins, 2006), I asserted that sample size for adequate precision for the estimate of the standard deviation representing individual responses in a controlled trial was similar to that for the subject characteristics that potentially explain the individual responses. That assertion was incorrect. In this In-brief item I show that the required sample size in the worst-case scenario of zero mean change and zero individual responses is 6.5n2, where n is the sample size for adequate precision of the mean. Since n is usually at least 20, planning for adequate precision of the estimate of individual responses is obviously impractical. Instead, researchers should plan for adequate precision of the subject characteristics and mechanism variables that might explain individual responses, since their sample size in the worst-case scenario is "only" 4n (Hopkins, 2006). The standard deviation for individual responses should still be assessed, because the estimate will be clear for sufficiently large values, and in any case it is important to know how large the individual responses might be, as shown by the upper confidence limit.
The magnitude of individual responses is expressed as a standard deviation, SDIR (e.g., ±2.6% around the treatment's mean effect of 1.8%). The sampling variance (standard error squared) in SDIR2 is given by statistical first principles as 2V2/DF, where V=SDIR2 and DF is the degrees of freedom of the SDIR. V is the difference in the variances of the change scores in the experimental and control groups; hence the sampling variance of SDIR2 is 2SDDE4/(nIR-1) + 2SDDC4/(nIR-1), where SDDE and SDDE are the standard deviations of change scores in the experimental and control groups, and nIR is the sample size required in each group (assumed equal) to give adequate precision to SDIR. The square root of this expression is the sampling standard error of SDIR2. In the worst case-scenario, SDIR = 0, so SDDE = SDDC = SDD, so the sampling standard error of SDIR2 is 2SDD2/Ö(nIR-1). The sampling standard error of SDIR is not exactly equal to the square root of this expression. In a simple simulation of a normally distributed variance with mean zero, the expected sampling standard error of the square root of the variance is ~0.80 of the square root of the sampling variance of the variance. Hence the sampling standard error of SDIR is 0.80Ö[2SDD2/Ö(nIR-1)]. Since nIR turns out to be very much greater than 1, it follows that the uncertainty in SDIR is inversely proportional to the fourth root of the sample size, whereas the uncertainty in mean effects is inversely proportional only to the square root.
Now, the smallest important value of a standard deviation is half that of a difference or change in a mean (Smith and Hopkins, 2011; further justification provided below). Evidence that this rule applies to SDIR is provided by considering how the proportions of positive, trivial, and negative responders change as SDIR increases for a given mean effect of the treatment (Table 1).
These proportions were derived with a spreadsheet that can also be used to investigate how they are impacted by uncertainty in the SDIR. On the reasonable assumption that a difference of 10% in the proportion of responders is substantial, an SDIR of 0.5´ the smallest important mean change produces a substantial difference in proportions of responders when the mean change is trivial (0.5´ the smallest important change), and an SDIR of 1.0´ produces substantial differences in proportions when the mean change is zero or trivial. Larger values of SDIR are needed for substantial changes in proportions when changes in the mean are substantial. Thus 0.5´ the smallest important mean change is an appropriate smallest important value for SDIR in the worst-case scenario of trivial changes in the mean.
The standard error for SDIR therefore needs to be 0.5 of the standard error for the change in the mean, when the sample size for the change in the mean (nD) gives adequate precision for zero change in the mean. The standard error for the change in the mean in each group is SDD/ÖnD, and the standard error for the difference in the changes is Ö2SDD/ÖnD. So 0.80Ö[2SDD2/Ö(nIR-1)] = 0.5Ö2SDD/ÖnD, from which it follows that nIR = 1+(0.80/0.5)4nD2 = 6.5nD2. I have used simulations published in this issue of Sportscience to check that this formula is valid (Hopkins, 2018).
This is a very useful contribution to the body of knowledge on treatment heterogeneity. Hopkins has demonstrated that the required sample size for adequate precision of estimation of the SD for individual responses (in the worst-case scenario) is infeasibly large, and no such trial could ever be conducted. For example, consider a conventional parallel-group, before-and-after RCT planned with 90% power at 2-tailed P=0.05 to detect a difference of 3 mmHg in systolic blood pressure with an SD of 10 mmHg, with a correlation between baseline and follow-up measures over the time course of the experiment of r=0.7. Such a study, based on an ANCOVA analysis model to adjust for chance baseline imbalance, would require 120 participants in each arm. Detecting individual response variance with adequate precision would require up to 93,600 participants per group!
As Hopkins mentions, much smaller and more realistic sample sizes would be needed if the net mean effect (intervention minus control) and the SD for individual responses were substantial. However, he argues persuasively that it is more sensible to design trials with adequate precision to evaluate the effect of putative modifiers of true individual response variance. In this instance the “rule of 4” applies: for any such effect modifier we need 4´ the sample size required for the overall net mean effect (480 per arm in the above example). With ever increasing hype surrounding personalized or precision medicine, we need larger trials and appropriate analysis methods to make robust inferences.