Elements of Sample Size
February 06, 2023Information
In this fourth video, we discuss sample size considerations for quantitative, dichotomous, and time-to-event endpoints.
ID9452
To CiteDCA Citation Guide
- 00:02<v ->[Maria Ciarleglio] My name is Maria Ciarleglio</v>
- 00:04and I'm a faculty member in the Department of Biostatistics
- 00:08at the Yale School of Public Health.
- 00:11In this video series,
- 00:12I will introduce the clinical research process
- 00:15to prepare you to collaborate with a statistician.
- 00:20In this fourth video,
- 00:22we'll discuss elements of sample size determination,
- 00:25which is an important part of study design.
- 00:30In statistics, we apply methods that allow us to use data
- 00:35from a sample to answer a variety of questions.
- 00:39Sample data are used to estimate population parameters
- 00:43such as population means or population proportions.
- 00:47Develop models relating one or more explanatory variable
- 00:51to a response variable, and test hypotheses.
- 00:55In all we do, we answer these questions
- 00:57using a representative sample
- 01:00from the population of interest.
- 01:02This leads to the natural question
- 01:04how large should the sample be?
- 01:07Sample size methods that we'll discuss today
- 01:09are presented with the idea of a parallel-design
- 01:13two-arm randomized clinical trial in mind,
- 01:16but they can also be applied to other designs,
- 01:19such as observational studies.
- 01:21Our goal is to determine the sample size needed
- 01:24to be able to detect a hypothesized difference
- 01:28of clinical interest between the two study arms.
- 01:31If a difference truly exists
- 01:33we want to be able to detect that difference
- 01:36with high probability,
- 01:38and that probability is a term
- 01:40we'll introduce shortly known as statistical power.
- 01:44The sample size calculation depends
- 01:46on the planned hypothesis test,
- 01:48but the planned test depends
- 01:50on the study's primary endpoint.
- 01:53In video three, we reviewed quantitative endpoints,
- 01:57dichotomous endpoints, and time to event endpoints.
- 02:00When the primary endpoint is a continuous measure,
- 02:04such as change in portal pressure within a subject,
- 02:07we summarize the response using the average
- 02:11or mean change in portal pressure in each treatment group.
- 02:15If there's no difference between the treatments on average
- 02:18then the difference in means in group two versus group one,
- 02:23mu two, minus mu one, equals zero.
- 02:26This is called our null hypothesis.
- 02:29Our goal is usually to demonstrate a difference,
- 02:32so we would like to reject the null hypothesis
- 02:36and conclude that a difference exists.
- 02:39The hypothesis of a difference
- 02:41or effect is called the alternative hypothesis,
- 02:45and we'll discuss the alternative hypothesis
- 02:47on the next slide.
- 02:50When the primary endpoint is a dichotomous measure,
- 02:53such as treatment response, yes or no,
- 02:56we summarize the response using the proportion
- 02:59of patients who respond in each treatment group.
- 03:02Again, if there's no difference between the treatments
- 03:05the difference in proportions,
- 03:06P two, minus P one, equals zero under the null hypothesis.
- 03:12When the primary endpoint is time to event,
- 03:15such as time to death or time to relapse,
- 03:17then we often represent the effect
- 03:20in terms of the hazard rate, lambda.
- 03:23If there's no difference between the treatments
- 03:25then under the null hypothesis,
- 03:27the difference in the hazard rates equals zero,
- 03:31or equivalently, the hazard ratio equals one.
- 03:36Ideally, if there's a treatment effect,
- 03:38we reject the null hypothesis
- 03:40and conclude the alternative hypothesis,
- 03:43that states there is a difference in the two populations.
- 03:46For example, suppose our goal is to determine
- 03:50if there's a difference in the proportion
- 03:51of responders in those on Sorafenib compared to placebo.
- 03:56The alternative hypothesis tested is a two-sided alternative
- 04:00that the difference in the proportion of responders
- 04:02is not equal to zero.
- 04:04When performing two-sided tests,
- 04:07if our test statistic falls
- 04:09into either of the blue rejection regions, shown here,
- 04:13we would reject the null hypothesis of no difference,
- 04:16and conclude that there's a significant difference
- 04:18between treatments.
- 04:21Suppose we, instead, were only interested
- 04:24in a significant conclusion,
- 04:26if we showed that the proportion of responders
- 04:29is higher in the treatment group
- 04:31compared to the placebo group,
- 04:34this would give a difference in proportions
- 04:36greater than zero.
- 04:37In this case, we would only be interested
- 04:40in effects in the upper tail, shown in red.
- 04:43This one directional test is called a one-sided test
- 04:47or more specifically an upper tail test.
- 04:51Similarly, we may be interested in an effect
- 04:55in the negative direction,
- 04:56in which case we would only look
- 04:58for a significant conclusion in the lower tail.
- 05:02This one-sided test is a lower tail test.
- 05:07The direction of your test
- 05:08affects your sample size calculation.
- 05:11We will talk about this alpha symbol shortly,
- 05:14but it's called the significance level,
- 05:16or the type one error of your test.
- 05:19It's the probability
- 05:20of incorrectly rejecting the null hypothesis.
- 05:25In one-sided tests,
- 05:26since we're only looking in one direction
- 05:29for evidence against the null hypothesis,
- 05:31all of our type one error is in a single tail.
- 05:35However, with two-sided tests, because it's possible
- 05:38for us to reject the null hypothesis,
- 05:41if there is extreme evidence in either tail
- 05:44we split our type one error between the two tails.
- 05:48When looking at each tail
- 05:50we actually require stronger evidence
- 05:53against the null to reject in the case of a two-sided test.
- 05:58Since it's more difficult to reject the null hypothesis
- 06:01we will need a larger sample size
- 06:03when performing a two-sided test compared
- 06:06to a one-sided test.
- 06:09We recommend performing two-sided tests.
- 06:13Although you might expect a new treatment
- 06:16to demonstrate superiority over the control treatment,
- 06:20it's always good to have the option to formally
- 06:23reject the null hypothesis
- 06:24if an effect is seen in the opposite direction.
- 06:29There are several key factors
- 06:31that affect the required sample size.
- 06:33The hypothesized treatment difference, delta,
- 06:36the variability or noise in the endpoint measurement, sigma,
- 06:40the level of statistical significance, alpha,
- 06:43and the level of statistical power, one minus beta.
- 06:48We'll discuss each of these components,
- 06:50starting with the expected clinical difference
- 06:53between the two treatments being tested.
- 06:56In order to estimate sample size, you must first specify
- 07:00the magnitude of the difference you wish to detect.
- 07:04We denote this difference as delta.
- 07:07Sample size is calculated
- 07:08under a specific alternative hypothesis,
- 07:11that the difference in your parameters,
- 07:14the difference in means here, is equal to delta.
- 07:18The blue curve shows us the distribution
- 07:21of the difference in means under the null hypothesis.
- 07:24Under the null, the distribution is centered at zero,
- 07:28assuming no difference on average between the treatments.
- 07:33Under the alternative,
- 07:34for the purpose of sample size calculations,
- 07:37the distribution of the difference in means
- 07:39is centered at delta and is represented by the red curve.
- 07:43The more different the two distributions are assumed to be,
- 07:47the larger delta,
- 07:49and the less overlap we see between the two distributions.
- 07:53Smaller differences are more difficult to detect
- 07:57because the distributions are closer together,
- 07:59and, as a result, we require a larger sample size
- 08:03to be able to detect the small difference
- 08:05and distinguish that difference from random variation.
- 08:09Larger hypothesized differences
- 08:11require smaller sample sizes.
- 08:14How do you choose a value for delta?
- 08:16Sometimes there's prior knowledge
- 08:18that allows an investigator to anticipate
- 08:21the treatment benefit that's likely to be observed,
- 08:24and the role of the study is to confirm that expectation.
- 08:27Other times, delta's taken to equal the smallest
- 08:30or minimum clinically relevant difference
- 08:33that would warrant adopting the new treatment.
- 08:37Investigators are often optimistic
- 08:40about the effect of a new treatment,
- 08:42and that's understandable,
- 08:43but I recommend you not be overly optimistic.
- 08:47If the treatment effect is not as large as expected,
- 08:51you could end up with a null or negative trial,
- 08:54which is a trial that does not show
- 08:56a significant difference.
- 08:58There may actually be a true and worthwhile
- 09:02treatment benefit that's been missed
- 09:04because the difference was mis-specified
- 09:06or hypothesized to be too large.
- 09:09This is why a lot of thought
- 09:11needs to go into the study design
- 09:14and what is considered meaningful.
- 09:18The next element involved in sample size determination
- 09:21is the standard deviation of the primary endpoint.
- 09:24The standard deviation is denoted by sigma,
- 09:27and needs to be specified
- 09:29when we're dealing with a continuous primary endpoint.
- 09:33In this figure,
- 09:34there are actually four normal distributions plotted.
- 09:38Let's begin with the solid blue curve
- 09:41and the solid red curve.
- 09:43These two curves have the same standard deviation.
- 09:46Their standard deviation is larger
- 09:49than the standard deviation
- 09:50of the dashed blue curve and the dashed red curve.
- 09:54As sigma decreases,
- 09:55there's less overlap between the two distributions.
- 09:59More noise or higher variability makes it more
- 10:03difficult to detect differences
- 10:05and requires a larger sample size.
- 10:09One thing to note, is that the treatment difference,
- 10:12delta, is sometimes standardized
- 10:14and presented as an effect size
- 10:17denoted here by capital delta.
- 10:19This is simply little delta divided by sigma.
- 10:24There are two errors that we can make
- 10:27when we perform a hypothesis test,
- 10:29and both of them influence sample size.
- 10:31We fix these errors
- 10:33at levels that we believe to be acceptable,
- 10:35and they're usually set to relatively small values.
- 10:39The first error we'll discuss is type one error
- 10:42or the alpha level of the test.
- 10:45The blue curve is, again,
- 10:46the distribution of the difference in means
- 10:49under the null hypothesis.
- 10:51The red curve is the distribution
- 10:54under the specific alternative hypothesis
- 10:57that assumes the treatment effect is equal to delta.
- 11:01Hypothesis testing is performed
- 11:03assuming the null hypothesis is true.
- 11:06That is assuming the blue curve is true.
- 11:09The green shaded area in the tails of the blue curve
- 11:13are extreme values that aren't likely to be observed
- 11:16if the difference in means is equal to zero,
- 11:20that is if the null hypothesis is true.
- 11:22If we observe a result in the green shaded area
- 11:26then we'll reject the null hypothesis
- 11:29and conclude the alternative hypothesis.
- 11:32This is equivalent to observing a P value
- 11:34of the test less than or equal to alpha.
- 11:38However, if the null hypothesis is true,
- 11:41then we're committing an error
- 11:43by concluding there is an effect,
- 11:45there is a difference, when in fact there isn't one.
- 11:49This is called a type one error.
- 11:51The smaller you make the green shaded area,
- 11:54the less likely you will incorrectly reject
- 11:57the null hypothesis,
- 11:59because you're going to require
- 12:00greater and greater evidence to do so.
- 12:03We typically set alpha equal to 0.05
- 12:06because it's felt that a 5% chance
- 12:09of falsely rejecting the null hypothesis is acceptable.
- 12:14Choosing a smaller alpha will increase your protection
- 12:18against committing a type one error,
- 12:20but there's a trade off
- 12:21in that it will be more difficult for you to conclude
- 12:24there's a difference, even when there is one.
- 12:27Decreasing alpha will increase the required sample size.
- 12:33The second error is called type two error,
- 12:36and it's denoted beta,
- 12:39the gray shaded region in this figure.
- 12:43We do not reject the null hypothesis
- 12:45if the difference we observe falls in the gray region.
- 12:49We only reject the null if the difference observed
- 12:52falls in either of the green regions,
- 12:55the rejection region of the test.
- 12:57However, because the two distributions overlap
- 13:00there is this gray shaded region
- 13:02where the alternative hypothesis is true,
- 13:05the red curve is true, but we fail to reject the null
- 13:08because we don't observe an effect that's extreme enough.
- 13:12When this occurs, we are committing a type two error.
- 13:16Of course, we want the type two error to be low,
- 13:19but rather than set beta, we usually set one minus beta,
- 13:23which is called the statistical power of the test.
- 13:27This is represented by the purple shaded area.
- 13:31Power is the probability of rejecting the null hypothesis
- 13:35when we should.
- 13:36That is rejecting a false null hypothesis
- 13:40and we want this probability to be high.
- 13:43We typically set power to be at least 80%.
- 13:47Larger power will require a larger sample size
- 13:51to increase our chance of detecting a true difference.
- 13:56If you work with all of these ideas in their equation form
- 14:00you can derive a fundamental sample size equation
- 14:04that relates all four of these parameters
- 14:06to the sample size required in each treatment group.
- 14:10This equation shown here assumes a continuous
- 14:13primary outcome variable,
- 14:14but the relationships are the same for any outcome,
- 14:17including binary and time to event.
- 14:20We see sigma and delta.
- 14:23Delta is the treatment effect
- 14:25or the difference in group means.
- 14:27As I mentioned before,
- 14:29you can divide the difference in means, delta,
- 14:32by the common standard deviation, sigma,
- 14:35to write the equation as a function
- 14:36of the standardized effect size.
- 14:39As sigma increases,
- 14:41it's clear that the sample size will increase.
- 14:43This is because the data are more noisy,
- 14:46more heterogeneous, and it's more difficult
- 14:49to detect a signal when this is the case
- 14:52and we need a larger sample size.
- 14:54Delta is in the denominator, so as delta decreases,
- 14:58the sample size will increase.
- 15:00When delta is small, there will be more overlap
- 15:04between the two distributions
- 15:05and it will be more difficult to detect a difference,
- 15:09so we need a larger sample size
- 15:11to detect smaller differences.
- 15:14All of these relationships make sense
- 15:15if you talk them through
- 15:17and they're supported by the equations
- 15:20used to perform the sample size calculations.
- 15:23In terms of alpha and beta,
- 15:25our type one and type two errors,
- 15:28they are here in the numerator
- 15:30but they're represented by their corresponding Z values.
- 15:34Smaller alpha and beta errors
- 15:36correspond to larger Z values and larger sample sizes.
- 15:42We'll wrap up this video by going
- 15:44through the three common endpoint types
- 15:47and discussing the elements of sample size determination
- 15:50that you need to define for the sample size calculation
- 15:54in each case.
- 15:55You'll need to specify the type one error level, alpha,
- 15:59and the direction of the alternative hypothesis.
- 16:02That is, are you performing a one-sided
- 16:04or a two-sided hypothesis test?
- 16:07You'll also need to specify the level
- 16:09of statistical power of the test.
- 16:12When your primary endpoint is a continuous variable,
- 16:16such as change in portal pressure,
- 16:18you'll need to specify delta,
- 16:20the magnitude of the hypothesized difference
- 16:23in mean portal pressure change in the two treatment groups,
- 16:27and sigma, the standard deviation
- 16:29of the change in portal pressure.
- 16:31We often assume that the variability
- 16:34of the response is the same in both arms,
- 16:37but the sample size calculations,
- 16:38they can accommodate unequal standard deviations
- 16:41in each population.
- 16:44Again, we can specify the difference
- 16:45as a standardized effect size.
- 16:48Cohen suggested values of the effect size that correspond
- 16:52to small, moderate, and large effects.
- 16:55A small effect is estimated at 0.2,
- 16:59a moderate effect is 0.5, and a large effect is 0.8.
- 17:05Again, as delta decreases,
- 17:07the sample size necessary to detect that effect increases.
- 17:13When your primary endpoint is a binary variable,
- 17:17such as development of surgical site infection,
- 17:19we summarize the response using the proportion of responders
- 17:23in each treatment group.
- 17:25The anticipated effects between groups can be expressed
- 17:28as the difference in the two proportions,
- 17:31P two, minus P one,
- 17:33so you would need to specify
- 17:35the hypothesized proportion of responders
- 17:38in each group for the sample size calculation.
- 17:42Finally, when your primary endpoint is a survival,
- 17:46or time to event endpoint, such as time to death,
- 17:49or time to progression,
- 17:50the anticipated effect size between groups
- 17:53is usually in the form of a difference in hazard rates
- 17:57Lambda two, minus lambda one, or a hazard ratio.
- 18:02You would need to specify the hypothesized hazards
- 18:05in each group, or the hypothesized hazard ratio.
- 18:09For example, if the intervention reduces the mortality rate
- 18:13by 20%, the hazard ratio would equal 0.8.
- 18:19You may have prior data
- 18:20on a quantity called median survival time.
- 18:23This is often reported in the literature.
- 18:27The median survival time is the time point
- 18:29when we expect the survival probability to equal 50%.
- 18:35In the sample data,
- 18:37the estimated survival probability
- 18:39or probability of surviving
- 18:41beyond a certain number of weeks
- 18:44is plotted on the vertical axis.
- 18:47The survival curve hits 50% at 23 weeks,
- 18:51so the median survival time
- 18:53in this group is estimated to be 23 weeks.
- 18:57Under the model that we typically use,
- 19:00the hazard ratio is equal
- 19:01to the ratio of the median survival times in the two groups.
- 19:06For example, if the median survival time
- 19:08in the drug group is twice that seen in the placebo group,
- 19:12the hypothesized hazard ratio would equal one half.
- 19:18Other important quantities to specify
- 19:21in a survival sample size calculation,
- 19:23is the duration of the accrual period
- 19:26and the duration of follow up.
- 19:29These will affect the number of events,
- 19:32since longer studies have a greater opportunity
- 19:35to observe study events.
- 19:39Lastly, I want to discuss an important issue
- 19:43that affects the required sample size,
- 19:45and that is the anticipated proportion of subjects
- 19:49who are lost to follow up.
- 19:50Since these subjects are lost,
- 19:52we'll never observe their endpoint,
- 19:55so we need to compensate for their loss.
- 19:57If the anticipated loss or withdrawal proportion
- 20:01is W, where W is a proportion between zero and one,
- 20:06then the required number of patients per group
- 20:09should be inflated to n adjusted,
- 20:12which equals the originally planned per group sample size,
- 20:16n, divided by one, minus W.
- 20:19The estimated size of W can often be obtained
- 20:23from prior studies.
- 20:25If there's no prior data,
- 20:26then you may want to set W equal to 0.1 or 10%.
- 20:32One thing to note is that we're assuming
- 20:34that the loss to follow up is occurring at random
- 20:38and it's not related to the health status of the subject.
- 20:42If it's true that, for example, sicker patients
- 20:44are dropping out of the study,
- 20:46then this may bias the results,
- 20:48especially if more of the sicker patients
- 20:50are dropping out of one group than the other.
- 20:54Inflating the sample size for dropouts
- 20:56will not fix a biased study,
- 20:58so it's important to try to minimize dropouts
- 21:01as much as possible.
- 21:04The sample size calculations
- 21:06are an important part of the study design process.
- 21:10The calculations can't be performed
- 21:12by the statistician alone.
- 21:14Input from the investigators and the study team is important
- 21:18when it comes to setting these sample size parameters.
- 21:21So it's my hope that you've come away with an understanding
- 21:24of the different factors that you need to consider
- 21:27and think about during the study planning process.