
Sample Size for Individual ResponsesWill
G Hopkins, Institute of Sport
Exercise and Active Living, Victoria University, Melbourne, Australia. Email.
In the article on samplesize estimation (Hopkins, 2006), I asserted that sample size for adequate precision for the estimate of the standard deviation representing individual responses in a controlled trial was similar to that for the subject characteristics that potentially explain the individual responses. That assertion was incorrect. In this Inbrief item I show that the required sample size in the worstcase scenario of zero mean change and zero individual responses is 6.5n^{2}, where n is the sample size for adequate precision of the mean. Since n is usually at least 20, planning for adequate precision of the estimate of individual responses is obviously impractical. Instead, researchers should plan for adequate precision of the subject characteristics and mechanism variables that might explain individual responses, since their sample size in the worstcase scenario is "only" 4n (Hopkins, 2006). The standard deviation for individual responses should still be assessed, because the estimate will be clear for sufficiently large values, and in any case it is important to know how large the individual responses might be, as shown by the upper confidence limit. The magnitude of individual responses is expressed as a standard deviation, SD_{IR} (e.g., ±2.6% around the treatment's mean effect of 1.8%). The sampling variance (standard error squared) in SD_{IR}^{2} is given by statistical first principles as 2V^{2}/DF, where V=SD_{IR}^{2} and DF is the degrees of freedom of the SD_{IR}. V is the difference in the variances of the change scores in the experimental and control groups; hence the sampling variance of SD_{IR}^{2} is 2SD_{DE}^{4}/(n_{IR}1) + 2SD_{DC}^{4}/(n_{IR}1), where SD_{DE} and SD_{DE} are the standard deviations of change scores in the experimental and control groups, and n_{IR} is the sample size required in each group (assumed equal) to give adequate precision to SD_{IR}. The square root of this expression is the sampling standard error of SD_{IR}^{2}. In the worst casescenario, SD_{IR} = 0, so SD_{DE} = SD_{DC} = SD_{D}, so the sampling standard error of SD_{IR}^{2} is 2SD_{D}^{2}/Ö(n_{IR}1). The sampling standard error of SD_{IR} is not exactly equal to the square root of this expression. In a simple simulation of a normally distributed variance with mean zero, the expected sampling standard error of the square root of the variance is ~0.80 of the square root of the sampling variance of the variance. Hence the sampling standard error of SD_{IR} is 0.80Ö[2SD_{D}^{2}/Ö(n_{IR}1)]. Since n_{IR} turns out to be very much greater than 1, it follows that the uncertainty in SD_{IR} is inversely proportional to the fourth root of the sample size, whereas the uncertainty in mean effects is inversely proportional only to the square root. Now, the smallest important value of a standard deviation is half that of a difference or change in a mean (Smith and Hopkins, 2011; further justification provided below). Evidence that this rule applies to SD_{IR} is provided by considering how the proportions of positive, trivial, and negative responders change as SD_{IR} increases for a given mean effect of the treatment (Table 1).
These proportions were derived with a spreadsheet that can also be used to investigate how they are impacted by uncertainty in the SD_{IR}. On the reasonable assumption that a difference of 10% in the proportion of responders is substantial, an SD_{IR} of 0.5´ the smallest important mean change produces a substantial difference in proportions of responders when the mean change is trivial (0.5´ the smallest important change), and an SD_{IR} of 1.0´ produces substantial differences in proportions when the mean change is zero or trivial. Larger values of SD_{IR} are needed for substantial changes in proportions when changes in the mean are substantial. Thus 0.5´ the smallest important mean change is an appropriate smallest important value for SD_{IR} in the worstcase scenario of trivial changes in the mean. The standard error for SD_{IR} therefore needs to be 0.5 of the standard error for the change in the mean, when the sample size for the change in the mean (n_{D}) gives adequate precision for zero change in the mean. The standard error for the change in the mean in each group is SD_{D}/Ön_{D}, and the standard error for the difference in the changes is Ö2SD_{D}/Ön_{D}. So 0.80Ö[2SD_{D}^{2}/Ö(n_{IR}1)] = 0.5Ö2SD_{D}/Ön_{D}, from which it follows that n_{IR} = 1+(0.80/0.5)^{4}n_{D}^{2} = 6.5n_{D}^{2}. I have used simulations published in this issue of Sportscience to check that this formula is valid (Hopkins, 2018). Hopkins WG (2006). Estimating sample size for magnitudebased inferences. Sportscience 10, 6370 Reviewer's commentaryThis is a very useful contribution to the body of knowledge on treatment heterogeneity. Hopkins has demonstrated that the required sample size for adequate precision of estimation of the SD for individual responses (in the worstcase scenario) is infeasibly large, and no such trial could ever be conducted. For example, consider a conventional parallelgroup, beforeandafter RCT planned with 90% power at 2tailed P=0.05 to detect a difference of 3 mmHg in systolic blood pressure with an SD of 10 mmHg, with a correlation between baseline and followup measures over the time course of the experiment of r=0.7. Such a study, based on an ANCOVA analysis model to adjust for chance baseline imbalance, would require 120 participants in each arm. Detecting individual response variance with adequate precision would require up to 93,600 participants per group! As Hopkins mentions, much smaller and more realistic sample sizes would be needed if the net mean effect (intervention minus control) and the SD for individual responses were substantial. However, he argues persuasively that it is more sensible to design trials with adequate precision to evaluate the effect of putative modifiers of true individual response variance. In this instance the “rule of 4” applies: for any such effect modifier we need 4´ the sample size required for the overall net mean effect (480 per arm in the above example). With ever increasing hype surrounding personalized or precision medicine, we need larger trials and appropriate analysis methods to make robust inferences. ———– 