Skip to Main Content

Elements of Sample Size

February 06, 2023
  • 00:02<v ->[Maria Ciarleglio] My name is Maria Ciarleglio</v>
  • 00:04and I'm a faculty member in the Department of Biostatistics
  • 00:08at the Yale School of Public Health.
  • 00:11In this video series,
  • 00:12I will introduce the clinical research process
  • 00:15to prepare you to collaborate with a statistician.
  • 00:20In this fourth video,
  • 00:22we'll discuss elements of sample size determination,
  • 00:25which is an important part of study design.
  • 00:30In statistics, we apply methods that allow us to use data
  • 00:35from a sample to answer a variety of questions.
  • 00:39Sample data are used to estimate population parameters
  • 00:43such as population means or population proportions.
  • 00:47Develop models relating one or more explanatory variable
  • 00:51to a response variable, and test hypotheses.
  • 00:55In all we do, we answer these questions
  • 00:57using a representative sample
  • 01:00from the population of interest.
  • 01:02This leads to the natural question
  • 01:04how large should the sample be?
  • 01:07Sample size methods that we'll discuss today
  • 01:09are presented with the idea of a parallel-design
  • 01:13two-arm randomized clinical trial in mind,
  • 01:16but they can also be applied to other designs,
  • 01:19such as observational studies.
  • 01:21Our goal is to determine the sample size needed
  • 01:24to be able to detect a hypothesized difference
  • 01:28of clinical interest between the two study arms.
  • 01:31If a difference truly exists
  • 01:33we want to be able to detect that difference
  • 01:36with high probability,
  • 01:38and that probability is a term
  • 01:40we'll introduce shortly known as statistical power.
  • 01:44The sample size calculation depends
  • 01:46on the planned hypothesis test,
  • 01:48but the planned test depends
  • 01:50on the study's primary endpoint.
  • 01:53In video three, we reviewed quantitative endpoints,
  • 01:57dichotomous endpoints, and time to event endpoints.
  • 02:00When the primary endpoint is a continuous measure,
  • 02:04such as change in portal pressure within a subject,
  • 02:07we summarize the response using the average
  • 02:11or mean change in portal pressure in each treatment group.
  • 02:15If there's no difference between the treatments on average
  • 02:18then the difference in means in group two versus group one,
  • 02:23mu two, minus mu one, equals zero.
  • 02:26This is called our null hypothesis.
  • 02:29Our goal is usually to demonstrate a difference,
  • 02:32so we would like to reject the null hypothesis
  • 02:36and conclude that a difference exists.
  • 02:39The hypothesis of a difference
  • 02:41or effect is called the alternative hypothesis,
  • 02:45and we'll discuss the alternative hypothesis
  • 02:47on the next slide.
  • 02:50When the primary endpoint is a dichotomous measure,
  • 02:53such as treatment response, yes or no,
  • 02:56we summarize the response using the proportion
  • 02:59of patients who respond in each treatment group.
  • 03:02Again, if there's no difference between the treatments
  • 03:05the difference in proportions,
  • 03:06P two, minus P one, equals zero under the null hypothesis.
  • 03:12When the primary endpoint is time to event,
  • 03:15such as time to death or time to relapse,
  • 03:17then we often represent the effect
  • 03:20in terms of the hazard rate, lambda.
  • 03:23If there's no difference between the treatments
  • 03:25then under the null hypothesis,
  • 03:27the difference in the hazard rates equals zero,
  • 03:31or equivalently, the hazard ratio equals one.
  • 03:36Ideally, if there's a treatment effect,
  • 03:38we reject the null hypothesis
  • 03:40and conclude the alternative hypothesis,
  • 03:43that states there is a difference in the two populations.
  • 03:46For example, suppose our goal is to determine
  • 03:50if there's a difference in the proportion
  • 03:51of responders in those on Sorafenib compared to placebo.
  • 03:56The alternative hypothesis tested is a two-sided alternative
  • 04:00that the difference in the proportion of responders
  • 04:02is not equal to zero.
  • 04:04When performing two-sided tests,
  • 04:07if our test statistic falls
  • 04:09into either of the blue rejection regions, shown here,
  • 04:13we would reject the null hypothesis of no difference,
  • 04:16and conclude that there's a significant difference
  • 04:18between treatments.
  • 04:21Suppose we, instead, were only interested
  • 04:24in a significant conclusion,
  • 04:26if we showed that the proportion of responders
  • 04:29is higher in the treatment group
  • 04:31compared to the placebo group,
  • 04:34this would give a difference in proportions
  • 04:36greater than zero.
  • 04:37In this case, we would only be interested
  • 04:40in effects in the upper tail, shown in red.
  • 04:43This one directional test is called a one-sided test
  • 04:47or more specifically an upper tail test.
  • 04:51Similarly, we may be interested in an effect
  • 04:55in the negative direction,
  • 04:56in which case we would only look
  • 04:58for a significant conclusion in the lower tail.
  • 05:02This one-sided test is a lower tail test.
  • 05:07The direction of your test
  • 05:08affects your sample size calculation.
  • 05:11We will talk about this alpha symbol shortly,
  • 05:14but it's called the significance level,
  • 05:16or the type one error of your test.
  • 05:19It's the probability
  • 05:20of incorrectly rejecting the null hypothesis.
  • 05:25In one-sided tests,
  • 05:26since we're only looking in one direction
  • 05:29for evidence against the null hypothesis,
  • 05:31all of our type one error is in a single tail.
  • 05:35However, with two-sided tests, because it's possible
  • 05:38for us to reject the null hypothesis,
  • 05:41if there is extreme evidence in either tail
  • 05:44we split our type one error between the two tails.
  • 05:48When looking at each tail
  • 05:50we actually require stronger evidence
  • 05:53against the null to reject in the case of a two-sided test.
  • 05:58Since it's more difficult to reject the null hypothesis
  • 06:01we will need a larger sample size
  • 06:03when performing a two-sided test compared
  • 06:06to a one-sided test.
  • 06:09We recommend performing two-sided tests.
  • 06:13Although you might expect a new treatment
  • 06:16to demonstrate superiority over the control treatment,
  • 06:20it's always good to have the option to formally
  • 06:23reject the null hypothesis
  • 06:24if an effect is seen in the opposite direction.
  • 06:29There are several key factors
  • 06:31that affect the required sample size.
  • 06:33The hypothesized treatment difference, delta,
  • 06:36the variability or noise in the endpoint measurement, sigma,
  • 06:40the level of statistical significance, alpha,
  • 06:43and the level of statistical power, one minus beta.
  • 06:48We'll discuss each of these components,
  • 06:50starting with the expected clinical difference
  • 06:53between the two treatments being tested.
  • 06:56In order to estimate sample size, you must first specify
  • 07:00the magnitude of the difference you wish to detect.
  • 07:04We denote this difference as delta.
  • 07:07Sample size is calculated
  • 07:08under a specific alternative hypothesis,
  • 07:11that the difference in your parameters,
  • 07:14the difference in means here, is equal to delta.
  • 07:18The blue curve shows us the distribution
  • 07:21of the difference in means under the null hypothesis.
  • 07:24Under the null, the distribution is centered at zero,
  • 07:28assuming no difference on average between the treatments.
  • 07:33Under the alternative,
  • 07:34for the purpose of sample size calculations,
  • 07:37the distribution of the difference in means
  • 07:39is centered at delta and is represented by the red curve.
  • 07:43The more different the two distributions are assumed to be,
  • 07:47the larger delta,
  • 07:49and the less overlap we see between the two distributions.
  • 07:53Smaller differences are more difficult to detect
  • 07:57because the distributions are closer together,
  • 07:59and, as a result, we require a larger sample size
  • 08:03to be able to detect the small difference
  • 08:05and distinguish that difference from random variation.
  • 08:09Larger hypothesized differences
  • 08:11require smaller sample sizes.
  • 08:14How do you choose a value for delta?
  • 08:16Sometimes there's prior knowledge
  • 08:18that allows an investigator to anticipate
  • 08:21the treatment benefit that's likely to be observed,
  • 08:24and the role of the study is to confirm that expectation.
  • 08:27Other times, delta's taken to equal the smallest
  • 08:30or minimum clinically relevant difference
  • 08:33that would warrant adopting the new treatment.
  • 08:37Investigators are often optimistic
  • 08:40about the effect of a new treatment,
  • 08:42and that's understandable,
  • 08:43but I recommend you not be overly optimistic.
  • 08:47If the treatment effect is not as large as expected,
  • 08:51you could end up with a null or negative trial,
  • 08:54which is a trial that does not show
  • 08:56a significant difference.
  • 08:58There may actually be a true and worthwhile
  • 09:02treatment benefit that's been missed
  • 09:04because the difference was mis-specified
  • 09:06or hypothesized to be too large.
  • 09:09This is why a lot of thought
  • 09:11needs to go into the study design
  • 09:14and what is considered meaningful.
  • 09:18The next element involved in sample size determination
  • 09:21is the standard deviation of the primary endpoint.
  • 09:24The standard deviation is denoted by sigma,
  • 09:27and needs to be specified
  • 09:29when we're dealing with a continuous primary endpoint.
  • 09:33In this figure,
  • 09:34there are actually four normal distributions plotted.
  • 09:38Let's begin with the solid blue curve
  • 09:41and the solid red curve.
  • 09:43These two curves have the same standard deviation.
  • 09:46Their standard deviation is larger
  • 09:49than the standard deviation
  • 09:50of the dashed blue curve and the dashed red curve.
  • 09:54As sigma decreases,
  • 09:55there's less overlap between the two distributions.
  • 09:59More noise or higher variability makes it more
  • 10:03difficult to detect differences
  • 10:05and requires a larger sample size.
  • 10:09One thing to note, is that the treatment difference,
  • 10:12delta, is sometimes standardized
  • 10:14and presented as an effect size
  • 10:17denoted here by capital delta.
  • 10:19This is simply little delta divided by sigma.
  • 10:24There are two errors that we can make
  • 10:27when we perform a hypothesis test,
  • 10:29and both of them influence sample size.
  • 10:31We fix these errors
  • 10:33at levels that we believe to be acceptable,
  • 10:35and they're usually set to relatively small values.
  • 10:39The first error we'll discuss is type one error
  • 10:42or the alpha level of the test.
  • 10:45The blue curve is, again,
  • 10:46the distribution of the difference in means
  • 10:49under the null hypothesis.
  • 10:51The red curve is the distribution
  • 10:54under the specific alternative hypothesis
  • 10:57that assumes the treatment effect is equal to delta.
  • 11:01Hypothesis testing is performed
  • 11:03assuming the null hypothesis is true.
  • 11:06That is assuming the blue curve is true.
  • 11:09The green shaded area in the tails of the blue curve
  • 11:13are extreme values that aren't likely to be observed
  • 11:16if the difference in means is equal to zero,
  • 11:20that is if the null hypothesis is true.
  • 11:22If we observe a result in the green shaded area
  • 11:26then we'll reject the null hypothesis
  • 11:29and conclude the alternative hypothesis.
  • 11:32This is equivalent to observing a P value
  • 11:34of the test less than or equal to alpha.
  • 11:38However, if the null hypothesis is true,
  • 11:41then we're committing an error
  • 11:43by concluding there is an effect,
  • 11:45there is a difference, when in fact there isn't one.
  • 11:49This is called a type one error.
  • 11:51The smaller you make the green shaded area,
  • 11:54the less likely you will incorrectly reject
  • 11:57the null hypothesis,
  • 11:59because you're going to require
  • 12:00greater and greater evidence to do so.
  • 12:03We typically set alpha equal to 0.05
  • 12:06because it's felt that a 5% chance
  • 12:09of falsely rejecting the null hypothesis is acceptable.
  • 12:14Choosing a smaller alpha will increase your protection
  • 12:18against committing a type one error,
  • 12:20but there's a trade off
  • 12:21in that it will be more difficult for you to conclude
  • 12:24there's a difference, even when there is one.
  • 12:27Decreasing alpha will increase the required sample size.
  • 12:33The second error is called type two error,
  • 12:36and it's denoted beta,
  • 12:39the gray shaded region in this figure.
  • 12:43We do not reject the null hypothesis
  • 12:45if the difference we observe falls in the gray region.
  • 12:49We only reject the null if the difference observed
  • 12:52falls in either of the green regions,
  • 12:55the rejection region of the test.
  • 12:57However, because the two distributions overlap
  • 13:00there is this gray shaded region
  • 13:02where the alternative hypothesis is true,
  • 13:05the red curve is true, but we fail to reject the null
  • 13:08because we don't observe an effect that's extreme enough.
  • 13:12When this occurs, we are committing a type two error.
  • 13:16Of course, we want the type two error to be low,
  • 13:19but rather than set beta, we usually set one minus beta,
  • 13:23which is called the statistical power of the test.
  • 13:27This is represented by the purple shaded area.
  • 13:31Power is the probability of rejecting the null hypothesis
  • 13:35when we should.
  • 13:36That is rejecting a false null hypothesis
  • 13:40and we want this probability to be high.
  • 13:43We typically set power to be at least 80%.
  • 13:47Larger power will require a larger sample size
  • 13:51to increase our chance of detecting a true difference.
  • 13:56If you work with all of these ideas in their equation form
  • 14:00you can derive a fundamental sample size equation
  • 14:04that relates all four of these parameters
  • 14:06to the sample size required in each treatment group.
  • 14:10This equation shown here assumes a continuous
  • 14:13primary outcome variable,
  • 14:14but the relationships are the same for any outcome,
  • 14:17including binary and time to event.
  • 14:20We see sigma and delta.
  • 14:23Delta is the treatment effect
  • 14:25or the difference in group means.
  • 14:27As I mentioned before,
  • 14:29you can divide the difference in means, delta,
  • 14:32by the common standard deviation, sigma,
  • 14:35to write the equation as a function
  • 14:36of the standardized effect size.
  • 14:39As sigma increases,
  • 14:41it's clear that the sample size will increase.
  • 14:43This is because the data are more noisy,
  • 14:46more heterogeneous, and it's more difficult
  • 14:49to detect a signal when this is the case
  • 14:52and we need a larger sample size.
  • 14:54Delta is in the denominator, so as delta decreases,
  • 14:58the sample size will increase.
  • 15:00When delta is small, there will be more overlap
  • 15:04between the two distributions
  • 15:05and it will be more difficult to detect a difference,
  • 15:09so we need a larger sample size
  • 15:11to detect smaller differences.
  • 15:14All of these relationships make sense
  • 15:15if you talk them through
  • 15:17and they're supported by the equations
  • 15:20used to perform the sample size calculations.
  • 15:23In terms of alpha and beta,
  • 15:25our type one and type two errors,
  • 15:28they are here in the numerator
  • 15:30but they're represented by their corresponding Z values.
  • 15:34Smaller alpha and beta errors
  • 15:36correspond to larger Z values and larger sample sizes.
  • 15:42We'll wrap up this video by going
  • 15:44through the three common endpoint types
  • 15:47and discussing the elements of sample size determination
  • 15:50that you need to define for the sample size calculation
  • 15:54in each case.
  • 15:55You'll need to specify the type one error level, alpha,
  • 15:59and the direction of the alternative hypothesis.
  • 16:02That is, are you performing a one-sided
  • 16:04or a two-sided hypothesis test?
  • 16:07You'll also need to specify the level
  • 16:09of statistical power of the test.
  • 16:12When your primary endpoint is a continuous variable,
  • 16:16such as change in portal pressure,
  • 16:18you'll need to specify delta,
  • 16:20the magnitude of the hypothesized difference
  • 16:23in mean portal pressure change in the two treatment groups,
  • 16:27and sigma, the standard deviation
  • 16:29of the change in portal pressure.
  • 16:31We often assume that the variability
  • 16:34of the response is the same in both arms,
  • 16:37but the sample size calculations,
  • 16:38they can accommodate unequal standard deviations
  • 16:41in each population.
  • 16:44Again, we can specify the difference
  • 16:45as a standardized effect size.
  • 16:48Cohen suggested values of the effect size that correspond
  • 16:52to small, moderate, and large effects.
  • 16:55A small effect is estimated at 0.2,
  • 16:59a moderate effect is 0.5, and a large effect is 0.8.
  • 17:05Again, as delta decreases,
  • 17:07the sample size necessary to detect that effect increases.
  • 17:13When your primary endpoint is a binary variable,
  • 17:17such as development of surgical site infection,
  • 17:19we summarize the response using the proportion of responders
  • 17:23in each treatment group.
  • 17:25The anticipated effects between groups can be expressed
  • 17:28as the difference in the two proportions,
  • 17:31P two, minus P one,
  • 17:33so you would need to specify
  • 17:35the hypothesized proportion of responders
  • 17:38in each group for the sample size calculation.
  • 17:42Finally, when your primary endpoint is a survival,
  • 17:46or time to event endpoint, such as time to death,
  • 17:49or time to progression,
  • 17:50the anticipated effect size between groups
  • 17:53is usually in the form of a difference in hazard rates
  • 17:57Lambda two, minus lambda one, or a hazard ratio.
  • 18:02You would need to specify the hypothesized hazards
  • 18:05in each group, or the hypothesized hazard ratio.
  • 18:09For example, if the intervention reduces the mortality rate
  • 18:13by 20%, the hazard ratio would equal 0.8.
  • 18:19You may have prior data
  • 18:20on a quantity called median survival time.
  • 18:23This is often reported in the literature.
  • 18:27The median survival time is the time point
  • 18:29when we expect the survival probability to equal 50%.
  • 18:35In the sample data,
  • 18:37the estimated survival probability
  • 18:39or probability of surviving
  • 18:41beyond a certain number of weeks
  • 18:44is plotted on the vertical axis.
  • 18:47The survival curve hits 50% at 23 weeks,
  • 18:51so the median survival time
  • 18:53in this group is estimated to be 23 weeks.
  • 18:57Under the model that we typically use,
  • 19:00the hazard ratio is equal
  • 19:01to the ratio of the median survival times in the two groups.
  • 19:06For example, if the median survival time
  • 19:08in the drug group is twice that seen in the placebo group,
  • 19:12the hypothesized hazard ratio would equal one half.
  • 19:18Other important quantities to specify
  • 19:21in a survival sample size calculation,
  • 19:23is the duration of the accrual period
  • 19:26and the duration of follow up.
  • 19:29These will affect the number of events,
  • 19:32since longer studies have a greater opportunity
  • 19:35to observe study events.
  • 19:39Lastly, I want to discuss an important issue
  • 19:43that affects the required sample size,
  • 19:45and that is the anticipated proportion of subjects
  • 19:49who are lost to follow up.
  • 19:50Since these subjects are lost,
  • 19:52we'll never observe their endpoint,
  • 19:55so we need to compensate for their loss.
  • 19:57If the anticipated loss or withdrawal proportion
  • 20:01is W, where W is a proportion between zero and one,
  • 20:06then the required number of patients per group
  • 20:09should be inflated to n adjusted,
  • 20:12which equals the originally planned per group sample size,
  • 20:16n, divided by one, minus W.
  • 20:19The estimated size of W can often be obtained
  • 20:23from prior studies.
  • 20:25If there's no prior data,
  • 20:26then you may want to set W equal to 0.1 or 10%.
  • 20:32One thing to note is that we're assuming
  • 20:34that the loss to follow up is occurring at random
  • 20:38and it's not related to the health status of the subject.
  • 20:42If it's true that, for example, sicker patients
  • 20:44are dropping out of the study,
  • 20:46then this may bias the results,
  • 20:48especially if more of the sicker patients
  • 20:50are dropping out of one group than the other.
  • 20:54Inflating the sample size for dropouts
  • 20:56will not fix a biased study,
  • 20:58so it's important to try to minimize dropouts
  • 21:01as much as possible.
  • 21:04The sample size calculations
  • 21:06are an important part of the study design process.
  • 21:10The calculations can't be performed
  • 21:12by the statistician alone.
  • 21:14Input from the investigators and the study team is important
  • 21:18when it comes to setting these sample size parameters.
  • 21:21So it's my hope that you've come away with an understanding
  • 21:24of the different factors that you need to consider
  • 21:27and think about during the study planning process.