Analysis and Interpretation of single cells sequencing data – part 4 Multiple Test Correction

Name: Analysis and Interpretation of single cells sequencing data – part 4 Multiple Test Correction
Uploaded: 2021-08-25T15:10:48.03Z
Duration: 16 min 49 s

August 25, 2021

Information

ID: 6876
To Cite: DCA Citation Guide

Download Transcript

00:03OK, so today's talk will be mainly
00:06about the analysis of single
00:08cell RNA seekers, so we had.
00:10Last time, like the last three
00:13times with dealt with the analysis
00:15of biker NEC classic methods and
00:18pathway richemond analysis method,
00:20I have some slides because last
00:23time I remember there was there were
00:25questions more specific about the
00:27multiple test correction and so I
00:30have these lines at least to like.
00:34Try to explain two of the methods
00:37that are used in order to do that.
00:40So here.
00:41So we start with the assumption that
00:44we are bound are in our statistical
00:47analysis to that P value threshold of 0.05.
00:50So whenever we run a test on
00:53whatever we want to test and they,
00:56it seems that the most important
01:00thing is that you have a P value of
01:04less than 0.00 less point than 0.05.
01:08So RP value of 0.05 means that
01:11the probability of rejecting.
01:13You're not hypothesis when it's true is 5%.
01:17So for example,
01:19when we are comparing gene expression,
01:21we're performing a test of
01:24differential expression of 1
01:25gene between condition A and B.
01:27It means that they are the null
01:30hypothesis is that the Gina is
01:33not differentially expressed.
01:35So if we have a P value of 0.05,
01:37it means that we can reject the null
01:40hypothesis so that the gene is not changing.
01:43Yeah,
01:44and and the probability of that and
01:46Dolly Parton is to be true is only 5%,
01:48so that's why the lower the P value,
01:50the more confident we are in
01:54rejecting the null hypothesis.
01:56One of the limitations of this
01:57is that the P value doesn't say
02:00anything about our hypothesis,
02:01so that the gene is really
02:04differentially expressed UM.
02:06So when we run analysis in high to put way,
02:10we usually have multiple comparison.
02:12So it means that we run a test of
02:15differential expression differential
02:16expression for each gene and we
02:19usually have 10 to 20,000 genes.
02:21For example, in our NGS experiments,
02:24or when we do pathway richemond analysis.
02:27We're testing their retreat of
02:29hundreds or thousands of pathways.
02:31That means that we run a lot of tests
02:34and For these reasons that probability
02:37that is 5% to have false positives.
02:40So to say that the gene is
02:43differentially expressed,
02:44when in reality it is not rises with
02:47the number of tests that we run.
02:50And so, for example,
02:52here you see a numeric example,
02:54so if you come assume that we you run
02:57a differential expression analysis
03:00on next generation sequencing,
03:03you have a,
03:04you have a collection of data on
03:0710,000 genes and you select 1000 genes,
03:10so 10% with a P value that is less than 0.05.
03:15Now,
03:16since you run 10,000 test and you
03:20accepted the 5% probability to make an error,
03:23that means that you expected by chance
03:26to have a 5% of full 5% of tests that
03:31are true that are significant but not true.
03:37So we expect 500 out of the 10,000 genes to
03:42be picked up by chance.
03:45If we compare so 500 is what we expect
03:48is the number of genes we expect it
03:50to be in our list of differentially
03:52expressed when they are not,
03:54and it's 50% of the 1000 that we find.
03:59So if we calculate 500 out of 1000,
04:04that means that we have a 50% of genes
04:07that we expect to be false positive.
04:10So this ratio is 50% is what we
04:13call the false discovery rate,
04:15and in this case using just these
04:18values is 50%. So it's very high.
04:21It means that potentially one out of
04:23two genes can be a false positive.
04:26So multiple test correction methods are
04:28ways that to modify the original P values,
04:32particularly to increase the original P
04:35values so that this probability of having
04:38false positives is ultimately less.
04:40It is reduced.
04:45Yeah, I just want to make a quick
04:47addition to as this is a very important
04:50concept and the first time I see
04:52encounters a word not have processes.
04:55Always got confused initially or so.
04:57Why is the heck is not hypothesis also
04:59not purposes we can consider As for
05:01example if you're comparing two groups
05:03of samples and not help us is basically
05:06there's there's a 21 gene the gene
05:08you're looking at has no difference
05:09between these two groups, for example,
05:11so this will be done on hypothesis and the.
05:135% P value oh point. 5.050 point
05:1805 means you have a 5% chance
05:21to make a wrong call basically,
05:24so that's that's eventually.
05:25II interpreted that way myself.
05:28Yeah, exactly. So that's the
05:29first positive is a wrong code.
05:31So basically yes, and when you do
05:35pathway richemond Denali part is
05:37is that the pathway is not reached?
05:40So those are the two examples that I'm doing.
05:43Uhm, so all the multiple test
05:46correction methods increase the
05:48original key values with the aim
05:50of reducing these false positives.
05:53What they do not do is that they
05:55do not swap the order of P values,
05:57so you can imagine if you have your
05:59list of genes and your rank your genes
06:01according to the original P values.
06:03When you perform the transformation
06:05or the correction,
06:06you don't have changes in the
06:09rank of the jeans.
06:11So this simplest correction is
06:14the Bonferroni correction method.
06:16It's simple because you just take
06:18the original P value and you
06:20multiply the original P value for
06:22the number of tests that you perform.
06:24So in our case, for example,
06:26if you run tests for 10,000 genes,
06:28that means you take your original
06:30P values and you multiply each
06:32of these P values for 10,000.
06:34So that this is the formula,
06:36there just
06:37have a question I I'm
06:38understanding what you're saying,
06:39but there's adjusted P minus value.
06:43I know it's a P value,
06:44it's inequation, no, no.
06:45It's up evalue with a very large space,
06:48minus yeah. Oh yeah, yeah.
06:53OK, it's devalue so.
06:56I will correct this so it's just evalue.
06:59Yeah, it's the original P value.
07:01Multiply that by the number of tests.
07:04OK, thank you and this belongs to a family
07:08of correction methods that control for
07:11the so called familywise error rate.
07:14So after the correction when you
07:16take everything that is below 0.05,
07:19interpretation of that is that you're
07:23relying 5% of probability to have at least.
07:26One false positive.
07:28So you're controlling for the
07:30probability of having in your final
07:32list at least one false positive.
07:34That's why it's a very
07:37conservative correction,
07:38so it's very stringent because
07:39basically you are not allowing to have
07:42any false positive at all almost.
07:44And so,
07:45especially when you have a
07:47large number of tests.
07:48Since this is the number that you
07:50multiply your original P values for,
07:51it can be very not rewarding,
07:55meaning that after the correction.
07:56Would ever would be values
07:59basically reduced to 1.
08:01And so that no gene after
08:03the correction is selected.
08:04So that's why this is simple to explain,
08:06but it's rarely used.
08:08The most common method that is used
08:11is the Benjamini Hochberg correction,
08:13so this is the most popular
08:15multiple test correction method
08:17was introduced there 25 years ago,
08:19so this is belongs to a family
08:21of methods that designed to
08:23control the false discovery rate.
08:25So the proportion of false positives that we.
08:30One thing that we expect in our data,
08:32so we're not if we select
08:35after the correction.
08:370.05 it means that we allow 5% of our genes
08:42to be wrong calls or false positives.
08:45So it's a stepwise method,
08:47so it's it's a more slightly
08:49more complex than the bond,
08:51then the Bonferroni,
08:53but the formula is quite straightforward,
08:55so it requires a first year that you sort
08:59all your original values in increasing order.
09:02So from the smallest to the biggest value,
09:06so that you can calculate the rank.
09:09So this more length is the smallest
09:11has rank one, and then 2-3 and so on,
09:14so they adjusted.
09:16Failure is basically the original P
09:19value multiplied by the number of tests
09:22divided by the rank of the variable.
09:25So this means that you don't
09:27multiply your original value for a
09:30fixed number as in the Bonferroni,
09:32but the multiplication the amount
09:35of the multiplication depends on
09:37the rank of your original P value.
09:40And I have these examples to show the
09:43difference between the two approaches,
09:45so it's a simplified example.
09:48We run an analysis.
09:49Let's assume we run an analysis
09:52of Honor 5353 jeans.
09:55So these are our genes we're
09:57testing for differential expression
09:59of these genes into conditions,
10:00so these are the original P
10:02values that we get.
10:04For example from let's say at Test
10:07or something that is more specific
10:09for next generation sequencing data.
10:11So it's
10:12any test of differential expression.
10:15So this is the original P value and
10:18I ranked the genes in increasing
10:20values so that these genes level
10:234 is the most significant.
10:25And so on. This, the laster,
10:28has no significance at all,
10:29because the P value is 1.
10:32So this is the Bonferroni formula.
10:33So since we have a 53.
10:36Jeans we run 53 tests and so every
10:39number has to be multiplied by 53.
10:42So after this formula,
10:44that's the result that we get them.
10:47And so if before the correction all these
10:50jeans were below the 0.05 threshold.
10:53After the correction,
10:55only the first jeans is selected.
10:59Because it's the only one that is below 0.05.
11:03Uhm below here you see the same analysis
11:06but with the benjamini Hochberg correction.
11:10So the original P values are the same.
11:13It's a stepwise procedure because
11:16you start from the from the bottom.
11:19And so you multiply that these are,
11:22that is value one for the number
11:24of tests are divided by the rank.
11:26So this is multiplied by one,
11:28so it stays the same.
11:29And that's why it's one.
11:31Also here, then you multiply this
11:34value for the number of tests,
11:3653 for the rank.
11:38So this is slightly more than one,
11:40so you're a little bit increasing the.
11:44The value here from one this is the
11:47result and what I didn't tell you before
11:50is that it's not simply this formula,
11:52but once you get your these results,
11:54you have to check whether this is
11:57higher than the value of the corrected
12:00P value of the genes that precedes.
12:03In this case, it's lower,
12:04so we keep this.
12:05But if this was higher than we
12:08would have kept these this value
12:11and you see this here so here.
12:14I proceed and so we multiply this value
12:17here for 53 / 3 and this is the result.
12:21Now these results here,
12:23the multiplication would give you 0.04.
12:26This is higher than what you obtain here.
12:300.035 and so that's why instead of instead,
12:33instead of,
12:34the result will not be the
12:36exact result of this formula,
12:38but these jeans will take the
12:41value of the gene that precedes.
12:44And so that's where the final P
12:46value adjusted P value will be 0.035,
12:48and that's why when you use this method.
12:52You know, if you look now at.
12:53I mean,
12:54I noticed you can have a lot of
12:56adjusted values that are the same.
13:00Ask a question comma. This is great.
13:02I'm so appreciating your clarifying
13:05everything just on the bottom
13:07where it says BH adjusted P value
13:10and then in parentheses FDR.
13:12Q value clarify all those
13:16different things that FDR.
13:17I know it's false discovery rate in Q,
13:19but is this point 027?
13:22Could it be referred to as the P value,
13:25the FDR and the Q?
13:27So, uh, yeah, this is a
13:29little bit of terminology,
13:30so this notation here tells you how
13:33the P value has been adjusted so
13:35it has been adjusted with the with
13:38the benjamini Hochberg correction.
13:40So since this method belongs to
13:42a family of methods that are so
13:45called false discovery rate methods,
13:47so the the you can interpret the
13:50result also as a false discovery rate.
13:53So that's why sometimes you will not find.
13:57BH adjusted P value,
13:58but false discovery rate and
14:00also any adjusted P value is.
14:02I think it can be also called the Q value.
14:05Got it, thank you.
14:06So that means that FDR can be used also
14:09for with other corrections methods
14:11that belong to the same family,
14:13but they are not benjamini.
14:16Yes, Benjamin Hochberg corrected.
14:19So usually in publication you use
14:22FDR for example and then you specify
14:25in the methods that you use the
14:27Benjamini occupied in order to.
14:30Calculate the FDR.
14:32But sometimes it's left ambiguous most
14:34of the time it will be the Benjamin IAL,
14:37but in any case. Perfect thank you.
14:41And, uh, yeah.
14:42And finally the first gene as you see,
14:45only the first gene has the same
14:46correction as the Bonferroni,
14:48because this is the only case where
14:50these multiplication since the
14:52rank is one corresponds exactly
14:54to the Bonferroni formula.
14:55Unless so,
14:56unless the value of these is higher
14:58than the value of the second jeans.
15:01Because remember,
15:02in this case you take the you
15:04take the minor of the two values,
15:06the formula or the corrected
15:09values of the gene that precedes.
15:12And as you see, in this case,
15:14after you apply the Benjamini
15:15awkward after the correction,
15:17four of the genes are selected because
15:21the adjusted P value is below 5.
15:24So this is an example showing also
15:26that the Bonferroni is much less much
15:28more stringent than the Benjamini awkward.
15:31Because here you you accept
15:335% of a false positive.
15:35Here you accept the 5% probably
15:38to have one false positive,
15:39and that's the difference
15:42in the interpretation.
15:44No,
15:44it is not the one that I mentioned
15:46on the nomenclature is truly a big
15:48issue in the scientific literature
15:50because different people use different
15:52ways to refer to these things.
15:54For example, in some papers you
15:56will see the original P value,
15:57which is what Thomas listed
15:59on the third column.
16:00Here as P value.
16:01Some people refer this to
16:03this as a nominal P value,
16:04and some people just refer directly as
16:07P value and on the adjusted P value.
16:10Some people refer to as FDR.
16:12Some people refer to as a Q value.
16:14Some people refer to as like
16:16a tomasetta pH adjust P value.
16:19Some people even will just tell
16:20you that it's FDR adjusted P value.
16:23So there are many different
16:25normal creatures for basically
16:27the same things and different.
16:29Authors use different ways
16:30to refer to those things.
16:32Yeah, yeah there is no.
16:34Yeah I think yeah there
16:35is a lot of redundancy.
16:36Let's say now in terminology.
16:39And no specific rules.
16:42That depends on the reviewers mail.
16:45I see OK, so this was like this,
16:49uh, an introduction.