Elements of Sample Size

February 06, 2023

Information

In this fourth video, we discuss sample size considerations for quantitative, dichotomous, and time-to-event endpoints.

ID9452

To CiteDCA Citation Guide

00:02<v ->[Maria Ciarleglio] My name is Maria Ciarleglio</v>
00:04and I'm a faculty member in the Department of Biostatistics
00:08at the Yale School of Public Health.
00:11In this video series,
00:12I will introduce the clinical research process
00:15to prepare you to collaborate with a statistician.
00:20In this fourth video,
00:22we'll discuss elements of sample size determination,
00:25which is an important part of study design.
00:30In statistics, we apply methods that allow us to use data
00:35from a sample to answer a variety of questions.
00:39Sample data are used to estimate population parameters
00:43such as population means or population proportions.
00:47Develop models relating one or more explanatory variable
00:51to a response variable, and test hypotheses.
00:55In all we do, we answer these questions
00:57using a representative sample
01:00from the population of interest.
01:02This leads to the natural question
01:04how large should the sample be?
01:07Sample size methods that we'll discuss today
01:09are presented with the idea of a parallel-design
01:13two-arm randomized clinical trial in mind,
01:16but they can also be applied to other designs,
01:19such as observational studies.
01:21Our goal is to determine the sample size needed
01:24to be able to detect a hypothesized difference
01:28of clinical interest between the two study arms.
01:31If a difference truly exists
01:33we want to be able to detect that difference
01:36with high probability,
01:38and that probability is a term
01:40we'll introduce shortly known as statistical power.
01:44The sample size calculation depends
01:46on the planned hypothesis test,
01:48but the planned test depends
01:50on the study's primary endpoint.
01:53In video three, we reviewed quantitative endpoints,
01:57dichotomous endpoints, and time to event endpoints.
02:00When the primary endpoint is a continuous measure,
02:04such as change in portal pressure within a subject,
02:07we summarize the response using the average
02:11or mean change in portal pressure in each treatment group.
02:15If there's no difference between the treatments on average
02:18then the difference in means in group two versus group one,
02:23mu two, minus mu one, equals zero.
02:26This is called our null hypothesis.
02:29Our goal is usually to demonstrate a difference,
02:32so we would like to reject the null hypothesis
02:36and conclude that a difference exists.
02:39The hypothesis of a difference
02:41or effect is called the alternative hypothesis,
02:45and we'll discuss the alternative hypothesis
02:47on the next slide.
02:50When the primary endpoint is a dichotomous measure,
02:53such as treatment response, yes or no,
02:56we summarize the response using the proportion
02:59of patients who respond in each treatment group.
03:02Again, if there's no difference between the treatments
03:05the difference in proportions,
03:06P two, minus P one, equals zero under the null hypothesis.
03:12When the primary endpoint is time to event,
03:15such as time to death or time to relapse,
03:17then we often represent the effect
03:20in terms of the hazard rate, lambda.
03:23If there's no difference between the treatments
03:25then under the null hypothesis,
03:27the difference in the hazard rates equals zero,
03:31or equivalently, the hazard ratio equals one.
03:36Ideally, if there's a treatment effect,
03:38we reject the null hypothesis
03:40and conclude the alternative hypothesis,
03:43that states there is a difference in the two populations.
03:46For example, suppose our goal is to determine
03:50if there's a difference in the proportion
03:51of responders in those on Sorafenib compared to placebo.
03:56The alternative hypothesis tested is a two-sided alternative
04:00that the difference in the proportion of responders
04:02is not equal to zero.
04:04When performing two-sided tests,
04:07if our test statistic falls
04:09into either of the blue rejection regions, shown here,
04:13we would reject the null hypothesis of no difference,
04:16and conclude that there's a significant difference
04:18between treatments.
04:21Suppose we, instead, were only interested
04:24in a significant conclusion,
04:26if we showed that the proportion of responders
04:29is higher in the treatment group
04:31compared to the placebo group,
04:34this would give a difference in proportions
04:36greater than zero.
04:37In this case, we would only be interested
04:40in effects in the upper tail, shown in red.
04:43This one directional test is called a one-sided test
04:47or more specifically an upper tail test.
04:51Similarly, we may be interested in an effect
04:55in the negative direction,
04:56in which case we would only look
04:58for a significant conclusion in the lower tail.
05:02This one-sided test is a lower tail test.
05:07The direction of your test
05:08affects your sample size calculation.
05:11We will talk about this alpha symbol shortly,
05:14but it's called the significance level,
05:16or the type one error of your test.
05:19It's the probability
05:20of incorrectly rejecting the null hypothesis.
05:25In one-sided tests,
05:26since we're only looking in one direction
05:29for evidence against the null hypothesis,
05:31all of our type one error is in a single tail.
05:35However, with two-sided tests, because it's possible
05:38for us to reject the null hypothesis,
05:41if there is extreme evidence in either tail
05:44we split our type one error between the two tails.
05:48When looking at each tail
05:50we actually require stronger evidence
05:53against the null to reject in the case of a two-sided test.
05:58Since it's more difficult to reject the null hypothesis
06:01we will need a larger sample size
06:03when performing a two-sided test compared
06:06to a one-sided test.
06:09We recommend performing two-sided tests.
06:13Although you might expect a new treatment
06:16to demonstrate superiority over the control treatment,
06:20it's always good to have the option to formally
06:23reject the null hypothesis
06:24if an effect is seen in the opposite direction.
06:29There are several key factors
06:31that affect the required sample size.
06:33The hypothesized treatment difference, delta,
06:36the variability or noise in the endpoint measurement, sigma,
06:40the level of statistical significance, alpha,
06:43and the level of statistical power, one minus beta.
06:48We'll discuss each of these components,
06:50starting with the expected clinical difference
06:53between the two treatments being tested.
06:56In order to estimate sample size, you must first specify
07:00the magnitude of the difference you wish to detect.
07:04We denote this difference as delta.
07:07Sample size is calculated
07:08under a specific alternative hypothesis,
07:11that the difference in your parameters,
07:14the difference in means here, is equal to delta.
07:18The blue curve shows us the distribution
07:21of the difference in means under the null hypothesis.
07:24Under the null, the distribution is centered at zero,
07:28assuming no difference on average between the treatments.
07:33Under the alternative,
07:34for the purpose of sample size calculations,
07:37the distribution of the difference in means
07:39is centered at delta and is represented by the red curve.
07:43The more different the two distributions are assumed to be,
07:47the larger delta,
07:49and the less overlap we see between the two distributions.
07:53Smaller differences are more difficult to detect
07:57because the distributions are closer together,
07:59and, as a result, we require a larger sample size
08:03to be able to detect the small difference
08:05and distinguish that difference from random variation.
08:09Larger hypothesized differences
08:11require smaller sample sizes.
08:14How do you choose a value for delta?
08:16Sometimes there's prior knowledge
08:18that allows an investigator to anticipate
08:21the treatment benefit that's likely to be observed,
08:24and the role of the study is to confirm that expectation.
08:27Other times, delta's taken to equal the smallest
08:30or minimum clinically relevant difference
08:33that would warrant adopting the new treatment.
08:37Investigators are often optimistic
08:40about the effect of a new treatment,
08:42and that's understandable,
08:43but I recommend you not be overly optimistic.
08:47If the treatment effect is not as large as expected,
08:51you could end up with a null or negative trial,
08:54which is a trial that does not show
08:56a significant difference.
08:58There may actually be a true and worthwhile
09:02treatment benefit that's been missed
09:04because the difference was mis-specified
09:06or hypothesized to be too large.
09:09This is why a lot of thought
09:11needs to go into the study design
09:14and what is considered meaningful.
09:18The next element involved in sample size determination
09:21is the standard deviation of the primary endpoint.
09:24The standard deviation is denoted by sigma,
09:27and needs to be specified
09:29when we're dealing with a continuous primary endpoint.
09:33In this figure,
09:34there are actually four normal distributions plotted.
09:38Let's begin with the solid blue curve
09:41and the solid red curve.
09:43These two curves have the same standard deviation.
09:46Their standard deviation is larger
09:49than the standard deviation
09:50of the dashed blue curve and the dashed red curve.
09:54As sigma decreases,
09:55there's less overlap between the two distributions.
09:59More noise or higher variability makes it more
10:03difficult to detect differences
10:05and requires a larger sample size.
10:09One thing to note, is that the treatment difference,
10:12delta, is sometimes standardized
10:14and presented as an effect size
10:17denoted here by capital delta.
10:19This is simply little delta divided by sigma.
10:24There are two errors that we can make
10:27when we perform a hypothesis test,
10:29and both of them influence sample size.
10:31We fix these errors
10:33at levels that we believe to be acceptable,
10:35and they're usually set to relatively small values.
10:39The first error we'll discuss is type one error
10:42or the alpha level of the test.
10:45The blue curve is, again,
10:46the distribution of the difference in means
10:49under the null hypothesis.
10:51The red curve is the distribution
10:54under the specific alternative hypothesis
10:57that assumes the treatment effect is equal to delta.
11:01Hypothesis testing is performed
11:03assuming the null hypothesis is true.
11:06That is assuming the blue curve is true.
11:09The green shaded area in the tails of the blue curve
11:13are extreme values that aren't likely to be observed
11:16if the difference in means is equal to zero,
11:20that is if the null hypothesis is true.
11:22If we observe a result in the green shaded area
11:26then we'll reject the null hypothesis
11:29and conclude the alternative hypothesis.
11:32This is equivalent to observing a P value
11:34of the test less than or equal to alpha.
11:38However, if the null hypothesis is true,
11:41then we're committing an error
11:43by concluding there is an effect,
11:45there is a difference, when in fact there isn't one.
11:49This is called a type one error.
11:51The smaller you make the green shaded area,
11:54the less likely you will incorrectly reject
11:57the null hypothesis,
11:59because you're going to require
12:00greater and greater evidence to do so.
12:03We typically set alpha equal to 0.05
12:06because it's felt that a 5% chance
12:09of falsely rejecting the null hypothesis is acceptable.
12:14Choosing a smaller alpha will increase your protection
12:18against committing a type one error,
12:20but there's a trade off
12:21in that it will be more difficult for you to conclude
12:24there's a difference, even when there is one.
12:27Decreasing alpha will increase the required sample size.
12:33The second error is called type two error,
12:36and it's denoted beta,
12:39the gray shaded region in this figure.
12:43We do not reject the null hypothesis
12:45if the difference we observe falls in the gray region.
12:49We only reject the null if the difference observed
12:52falls in either of the green regions,
12:55the rejection region of the test.
12:57However, because the two distributions overlap
13:00there is this gray shaded region
13:02where the alternative hypothesis is true,
13:05the red curve is true, but we fail to reject the null
13:08because we don't observe an effect that's extreme enough.
13:12When this occurs, we are committing a type two error.
13:16Of course, we want the type two error to be low,
13:19but rather than set beta, we usually set one minus beta,
13:23which is called the statistical power of the test.
13:27This is represented by the purple shaded area.
13:31Power is the probability of rejecting the null hypothesis
13:35when we should.
13:36That is rejecting a false null hypothesis
13:40and we want this probability to be high.
13:43We typically set power to be at least 80%.
13:47Larger power will require a larger sample size
13:51to increase our chance of detecting a true difference.
13:56If you work with all of these ideas in their equation form
14:00you can derive a fundamental sample size equation
14:04that relates all four of these parameters
14:06to the sample size required in each treatment group.
14:10This equation shown here assumes a continuous
14:13primary outcome variable,
14:14but the relationships are the same for any outcome,
14:17including binary and time to event.
14:20We see sigma and delta.
14:23Delta is the treatment effect
14:25or the difference in group means.
14:27As I mentioned before,
14:29you can divide the difference in means, delta,
14:32by the common standard deviation, sigma,
14:35to write the equation as a function
14:36of the standardized effect size.
14:39As sigma increases,
14:41it's clear that the sample size will increase.
14:43This is because the data are more noisy,
14:46more heterogeneous, and it's more difficult
14:49to detect a signal when this is the case
14:52and we need a larger sample size.
14:54Delta is in the denominator, so as delta decreases,
14:58the sample size will increase.
15:00When delta is small, there will be more overlap
15:04between the two distributions
15:05and it will be more difficult to detect a difference,
15:09so we need a larger sample size
15:11to detect smaller differences.
15:14All of these relationships make sense
15:15if you talk them through
15:17and they're supported by the equations
15:20used to perform the sample size calculations.
15:23In terms of alpha and beta,
15:25our type one and type two errors,
15:28they are here in the numerator
15:30but they're represented by their corresponding Z values.
15:34Smaller alpha and beta errors
15:36correspond to larger Z values and larger sample sizes.
15:42We'll wrap up this video by going
15:44through the three common endpoint types
15:47and discussing the elements of sample size determination
15:50that you need to define for the sample size calculation
15:54in each case.
15:55You'll need to specify the type one error level, alpha,
15:59and the direction of the alternative hypothesis.
16:02That is, are you performing a one-sided
16:04or a two-sided hypothesis test?
16:07You'll also need to specify the level
16:09of statistical power of the test.
16:12When your primary endpoint is a continuous variable,
16:16such as change in portal pressure,
16:18you'll need to specify delta,
16:20the magnitude of the hypothesized difference
16:23in mean portal pressure change in the two treatment groups,
16:27and sigma, the standard deviation
16:29of the change in portal pressure.
16:31We often assume that the variability
16:34of the response is the same in both arms,
16:37but the sample size calculations,
16:38they can accommodate unequal standard deviations
16:41in each population.
16:44Again, we can specify the difference
16:45as a standardized effect size.
16:48Cohen suggested values of the effect size that correspond
16:52to small, moderate, and large effects.
16:55A small effect is estimated at 0.2,
16:59a moderate effect is 0.5, and a large effect is 0.8.
17:05Again, as delta decreases,
17:07the sample size necessary to detect that effect increases.
17:13When your primary endpoint is a binary variable,
17:17such as development of surgical site infection,
17:19we summarize the response using the proportion of responders
17:23in each treatment group.
17:25The anticipated effects between groups can be expressed
17:28as the difference in the two proportions,
17:31P two, minus P one,
17:33so you would need to specify
17:35the hypothesized proportion of responders
17:38in each group for the sample size calculation.
17:42Finally, when your primary endpoint is a survival,
17:46or time to event endpoint, such as time to death,
17:49or time to progression,
17:50the anticipated effect size between groups
17:53is usually in the form of a difference in hazard rates
17:57Lambda two, minus lambda one, or a hazard ratio.
18:02You would need to specify the hypothesized hazards
18:05in each group, or the hypothesized hazard ratio.
18:09For example, if the intervention reduces the mortality rate
18:13by 20%, the hazard ratio would equal 0.8.
18:19You may have prior data
18:20on a quantity called median survival time.
18:23This is often reported in the literature.
18:27The median survival time is the time point
18:29when we expect the survival probability to equal 50%.
18:35In the sample data,
18:37the estimated survival probability
18:39or probability of surviving
18:41beyond a certain number of weeks
18:44is plotted on the vertical axis.
18:47The survival curve hits 50% at 23 weeks,
18:51so the median survival time
18:53in this group is estimated to be 23 weeks.
18:57Under the model that we typically use,
19:00the hazard ratio is equal
19:01to the ratio of the median survival times in the two groups.
19:06For example, if the median survival time
19:08in the drug group is twice that seen in the placebo group,
19:12the hypothesized hazard ratio would equal one half.
19:18Other important quantities to specify
19:21in a survival sample size calculation,
19:23is the duration of the accrual period
19:26and the duration of follow up.
19:29These will affect the number of events,
19:32since longer studies have a greater opportunity
19:35to observe study events.
19:39Lastly, I want to discuss an important issue
19:43that affects the required sample size,
19:45and that is the anticipated proportion of subjects
19:49who are lost to follow up.
19:50Since these subjects are lost,
19:52we'll never observe their endpoint,
19:55so we need to compensate for their loss.
19:57If the anticipated loss or withdrawal proportion
20:01is W, where W is a proportion between zero and one,
20:06then the required number of patients per group
20:09should be inflated to n adjusted,
20:12which equals the originally planned per group sample size,
20:16n, divided by one, minus W.
20:19The estimated size of W can often be obtained
20:23from prior studies.
20:25If there's no prior data,
20:26then you may want to set W equal to 0.1 or 10%.
20:32One thing to note is that we're assuming
20:34that the loss to follow up is occurring at random
20:38and it's not related to the health status of the subject.
20:42If it's true that, for example, sicker patients
20:44are dropping out of the study,
20:46then this may bias the results,
20:48especially if more of the sicker patients
20:50are dropping out of one group than the other.
20:54Inflating the sample size for dropouts
20:56will not fix a biased study,
20:58so it's important to try to minimize dropouts
21:01as much as possible.
21:04The sample size calculations
21:06are an important part of the study design process.
21:10The calculations can't be performed
21:12by the statistician alone.
21:14Input from the investigators and the study team is important
21:18when it comes to setting these sample size parameters.
21:21So it's my hope that you've come away with an understanding
21:24of the different factors that you need to consider
21:27and think about during the study planning process.