Analysis and Interpretation of single cells sequencing data – part 4 Multiple Test Correction
August 25, 2021Information
- ID
- 6876
- To Cite
- DCA Citation Guide
Transcript
- 00:03OK, so today's talk will be mainly
- 00:06about the analysis of single
- 00:08cell RNA seekers, so we had.
- 00:10Last time, like the last three
- 00:13times with dealt with the analysis
- 00:15of biker NEC classic methods and
- 00:18pathway richemond analysis method,
- 00:20I have some slides because last
- 00:23time I remember there was there were
- 00:25questions more specific about the
- 00:27multiple test correction and so I
- 00:30have these lines at least to like.
- 00:34Try to explain two of the methods
- 00:37that are used in order to do that.
- 00:40So here.
- 00:41So we start with the assumption that
- 00:44we are bound are in our statistical
- 00:47analysis to that P value threshold of 0.05.
- 00:50So whenever we run a test on
- 00:53whatever we want to test and they,
- 00:56it seems that the most important
- 01:00thing is that you have a P value of
- 01:04less than 0.00 less point than 0.05.
- 01:08So RP value of 0.05 means that
- 01:11the probability of rejecting.
- 01:13You're not hypothesis when it's true is 5%.
- 01:17So for example,
- 01:19when we are comparing gene expression,
- 01:21we're performing a test of
- 01:24differential expression of 1
- 01:25gene between condition A and B.
- 01:27It means that they are the null
- 01:30hypothesis is that the Gina is
- 01:33not differentially expressed.
- 01:35So if we have a P value of 0.05,
- 01:37it means that we can reject the null
- 01:40hypothesis so that the gene is not changing.
- 01:43Yeah,
- 01:44and and the probability of that and
- 01:46Dolly Parton is to be true is only 5%,
- 01:48so that's why the lower the P value,
- 01:50the more confident we are in
- 01:54rejecting the null hypothesis.
- 01:56One of the limitations of this
- 01:57is that the P value doesn't say
- 02:00anything about our hypothesis,
- 02:01so that the gene is really
- 02:04differentially expressed UM.
- 02:06So when we run analysis in high to put way,
- 02:10we usually have multiple comparison.
- 02:12So it means that we run a test of
- 02:15differential expression differential
- 02:16expression for each gene and we
- 02:19usually have 10 to 20,000 genes.
- 02:21For example, in our NGS experiments,
- 02:24or when we do pathway richemond analysis.
- 02:27We're testing their retreat of
- 02:29hundreds or thousands of pathways.
- 02:31That means that we run a lot of tests
- 02:34and For these reasons that probability
- 02:37that is 5% to have false positives.
- 02:40So to say that the gene is
- 02:43differentially expressed,
- 02:44when in reality it is not rises with
- 02:47the number of tests that we run.
- 02:50And so, for example,
- 02:52here you see a numeric example,
- 02:54so if you come assume that we you run
- 02:57a differential expression analysis
- 03:00on next generation sequencing,
- 03:03you have a,
- 03:04you have a collection of data on
- 03:0710,000 genes and you select 1000 genes,
- 03:10so 10% with a P value that is less than 0.05.
- 03:15Now,
- 03:16since you run 10,000 test and you
- 03:20accepted the 5% probability to make an error,
- 03:23that means that you expected by chance
- 03:26to have a 5% of full 5% of tests that
- 03:31are true that are significant but not true.
- 03:37So we expect 500 out of the 10,000 genes to
- 03:42be picked up by chance.
- 03:45If we compare so 500 is what we expect
- 03:48is the number of genes we expect it
- 03:50to be in our list of differentially
- 03:52expressed when they are not,
- 03:54and it's 50% of the 1000 that we find.
- 03:59So if we calculate 500 out of 1000,
- 04:04that means that we have a 50% of genes
- 04:07that we expect to be false positive.
- 04:10So this ratio is 50% is what we
- 04:13call the false discovery rate,
- 04:15and in this case using just these
- 04:18values is 50%. So it's very high.
- 04:21It means that potentially one out of
- 04:23two genes can be a false positive.
- 04:26So multiple test correction methods are
- 04:28ways that to modify the original P values,
- 04:32particularly to increase the original P
- 04:35values so that this probability of having
- 04:38false positives is ultimately less.
- 04:40It is reduced.
- 04:45Yeah, I just want to make a quick
- 04:47addition to as this is a very important
- 04:50concept and the first time I see
- 04:52encounters a word not have processes.
- 04:55Always got confused initially or so.
- 04:57Why is the heck is not hypothesis also
- 04:59not purposes we can consider As for
- 05:01example if you're comparing two groups
- 05:03of samples and not help us is basically
- 05:06there's there's a 21 gene the gene
- 05:08you're looking at has no difference
- 05:09between these two groups, for example,
- 05:11so this will be done on hypothesis and the.
- 05:135% P value oh point. 5.050 point
- 05:1805 means you have a 5% chance
- 05:21to make a wrong call basically,
- 05:24so that's that's eventually.
- 05:25II interpreted that way myself.
- 05:28Yeah, exactly. So that's the
- 05:29first positive is a wrong code.
- 05:31So basically yes, and when you do
- 05:35pathway richemond Denali part is
- 05:37is that the pathway is not reached?
- 05:40So those are the two examples that I'm doing.
- 05:43Uhm, so all the multiple test
- 05:46correction methods increase the
- 05:48original key values with the aim
- 05:50of reducing these false positives.
- 05:53What they do not do is that they
- 05:55do not swap the order of P values,
- 05:57so you can imagine if you have your
- 05:59list of genes and your rank your genes
- 06:01according to the original P values.
- 06:03When you perform the transformation
- 06:05or the correction,
- 06:06you don't have changes in the
- 06:09rank of the jeans.
- 06:11So this simplest correction is
- 06:14the Bonferroni correction method.
- 06:16It's simple because you just take
- 06:18the original P value and you
- 06:20multiply the original P value for
- 06:22the number of tests that you perform.
- 06:24So in our case, for example,
- 06:26if you run tests for 10,000 genes,
- 06:28that means you take your original
- 06:30P values and you multiply each
- 06:32of these P values for 10,000.
- 06:34So that this is the formula,
- 06:36there just
- 06:37have a question I I'm
- 06:38understanding what you're saying,
- 06:39but there's adjusted P minus value.
- 06:43I know it's a P value,
- 06:44it's inequation, no, no.
- 06:45It's up evalue with a very large space,
- 06:48minus yeah. Oh yeah, yeah.
- 06:53OK, it's devalue so.
- 06:56I will correct this so it's just evalue.
- 06:59Yeah, it's the original P value.
- 07:01Multiply that by the number of tests.
- 07:04OK, thank you and this belongs to a family
- 07:08of correction methods that control for
- 07:11the so called familywise error rate.
- 07:14So after the correction when you
- 07:16take everything that is below 0.05,
- 07:19interpretation of that is that you're
- 07:23relying 5% of probability to have at least.
- 07:26One false positive.
- 07:28So you're controlling for the
- 07:30probability of having in your final
- 07:32list at least one false positive.
- 07:34That's why it's a very
- 07:37conservative correction,
- 07:38so it's very stringent because
- 07:39basically you are not allowing to have
- 07:42any false positive at all almost.
- 07:44And so,
- 07:45especially when you have a
- 07:47large number of tests.
- 07:48Since this is the number that you
- 07:50multiply your original P values for,
- 07:51it can be very not rewarding,
- 07:55meaning that after the correction.
- 07:56Would ever would be values
- 07:59basically reduced to 1.
- 08:01And so that no gene after
- 08:03the correction is selected.
- 08:04So that's why this is simple to explain,
- 08:06but it's rarely used.
- 08:08The most common method that is used
- 08:11is the Benjamini Hochberg correction,
- 08:13so this is the most popular
- 08:15multiple test correction method
- 08:17was introduced there 25 years ago,
- 08:19so this is belongs to a family
- 08:21of methods that designed to
- 08:23control the false discovery rate.
- 08:25So the proportion of false positives that we.
- 08:30One thing that we expect in our data,
- 08:32so we're not if we select
- 08:35after the correction.
- 08:370.05 it means that we allow 5% of our genes
- 08:42to be wrong calls or false positives.
- 08:45So it's a stepwise method,
- 08:47so it's it's a more slightly
- 08:49more complex than the bond,
- 08:51then the Bonferroni,
- 08:53but the formula is quite straightforward,
- 08:55so it requires a first year that you sort
- 08:59all your original values in increasing order.
- 09:02So from the smallest to the biggest value,
- 09:06so that you can calculate the rank.
- 09:09So this more length is the smallest
- 09:11has rank one, and then 2-3 and so on,
- 09:14so they adjusted.
- 09:16Failure is basically the original P
- 09:19value multiplied by the number of tests
- 09:22divided by the rank of the variable.
- 09:25So this means that you don't
- 09:27multiply your original value for a
- 09:30fixed number as in the Bonferroni,
- 09:32but the multiplication the amount
- 09:35of the multiplication depends on
- 09:37the rank of your original P value.
- 09:40And I have these examples to show the
- 09:43difference between the two approaches,
- 09:45so it's a simplified example.
- 09:48We run an analysis.
- 09:49Let's assume we run an analysis
- 09:52of Honor 5353 jeans.
- 09:55So these are our genes we're
- 09:57testing for differential expression
- 09:59of these genes into conditions,
- 10:00so these are the original P
- 10:02values that we get.
- 10:04For example from let's say at Test
- 10:07or something that is more specific
- 10:09for next generation sequencing data.
- 10:11So it's
- 10:12any test of differential expression.
- 10:15So this is the original P value and
- 10:18I ranked the genes in increasing
- 10:20values so that these genes level
- 10:234 is the most significant.
- 10:25And so on. This, the laster,
- 10:28has no significance at all,
- 10:29because the P value is 1.
- 10:32So this is the Bonferroni formula.
- 10:33So since we have a 53.
- 10:36Jeans we run 53 tests and so every
- 10:39number has to be multiplied by 53.
- 10:42So after this formula,
- 10:44that's the result that we get them.
- 10:47And so if before the correction all these
- 10:50jeans were below the 0.05 threshold.
- 10:53After the correction,
- 10:55only the first jeans is selected.
- 10:59Because it's the only one that is below 0.05.
- 11:03Uhm below here you see the same analysis
- 11:06but with the benjamini Hochberg correction.
- 11:10So the original P values are the same.
- 11:13It's a stepwise procedure because
- 11:16you start from the from the bottom.
- 11:19And so you multiply that these are,
- 11:22that is value one for the number
- 11:24of tests are divided by the rank.
- 11:26So this is multiplied by one,
- 11:28so it stays the same.
- 11:29And that's why it's one.
- 11:31Also here, then you multiply this
- 11:34value for the number of tests,
- 11:3653 for the rank.
- 11:38So this is slightly more than one,
- 11:40so you're a little bit increasing the.
- 11:44The value here from one this is the
- 11:47result and what I didn't tell you before
- 11:50is that it's not simply this formula,
- 11:52but once you get your these results,
- 11:54you have to check whether this is
- 11:57higher than the value of the corrected
- 12:00P value of the genes that precedes.
- 12:03In this case, it's lower,
- 12:04so we keep this.
- 12:05But if this was higher than we
- 12:08would have kept these this value
- 12:11and you see this here so here.
- 12:14I proceed and so we multiply this value
- 12:17here for 53 / 3 and this is the result.
- 12:21Now these results here,
- 12:23the multiplication would give you 0.04.
- 12:26This is higher than what you obtain here.
- 12:300.035 and so that's why instead of instead,
- 12:33instead of,
- 12:34the result will not be the
- 12:36exact result of this formula,
- 12:38but these jeans will take the
- 12:41value of the gene that precedes.
- 12:44And so that's where the final P
- 12:46value adjusted P value will be 0.035,
- 12:48and that's why when you use this method.
- 12:52You know, if you look now at.
- 12:53I mean,
- 12:54I noticed you can have a lot of
- 12:56adjusted values that are the same.
- 13:00Ask a question comma. This is great.
- 13:02I'm so appreciating your clarifying
- 13:05everything just on the bottom
- 13:07where it says BH adjusted P value
- 13:10and then in parentheses FDR.
- 13:12Q value clarify all those
- 13:16different things that FDR.
- 13:17I know it's false discovery rate in Q,
- 13:19but is this point 027?
- 13:22Could it be referred to as the P value,
- 13:25the FDR and the Q?
- 13:27So, uh, yeah, this is a
- 13:29little bit of terminology,
- 13:30so this notation here tells you how
- 13:33the P value has been adjusted so
- 13:35it has been adjusted with the with
- 13:38the benjamini Hochberg correction.
- 13:40So since this method belongs to
- 13:42a family of methods that are so
- 13:45called false discovery rate methods,
- 13:47so the the you can interpret the
- 13:50result also as a false discovery rate.
- 13:53So that's why sometimes you will not find.
- 13:57BH adjusted P value,
- 13:58but false discovery rate and
- 14:00also any adjusted P value is.
- 14:02I think it can be also called the Q value.
- 14:05Got it, thank you.
- 14:06So that means that FDR can be used also
- 14:09for with other corrections methods
- 14:11that belong to the same family,
- 14:13but they are not benjamini.
- 14:16Yes, Benjamin Hochberg corrected.
- 14:19So usually in publication you use
- 14:22FDR for example and then you specify
- 14:25in the methods that you use the
- 14:27Benjamini occupied in order to.
- 14:30Calculate the FDR.
- 14:32But sometimes it's left ambiguous most
- 14:34of the time it will be the Benjamin IAL,
- 14:37but in any case. Perfect thank you.
- 14:41And, uh, yeah.
- 14:42And finally the first gene as you see,
- 14:45only the first gene has the same
- 14:46correction as the Bonferroni,
- 14:48because this is the only case where
- 14:50these multiplication since the
- 14:52rank is one corresponds exactly
- 14:54to the Bonferroni formula.
- 14:55Unless so,
- 14:56unless the value of these is higher
- 14:58than the value of the second jeans.
- 15:01Because remember,
- 15:02in this case you take the you
- 15:04take the minor of the two values,
- 15:06the formula or the corrected
- 15:09values of the gene that precedes.
- 15:12And as you see, in this case,
- 15:14after you apply the Benjamini
- 15:15awkward after the correction,
- 15:17four of the genes are selected because
- 15:21the adjusted P value is below 5.
- 15:24So this is an example showing also
- 15:26that the Bonferroni is much less much
- 15:28more stringent than the Benjamini awkward.
- 15:31Because here you you accept
- 15:335% of a false positive.
- 15:35Here you accept the 5% probably
- 15:38to have one false positive,
- 15:39and that's the difference
- 15:42in the interpretation.
- 15:44No,
- 15:44it is not the one that I mentioned
- 15:46on the nomenclature is truly a big
- 15:48issue in the scientific literature
- 15:50because different people use different
- 15:52ways to refer to these things.
- 15:54For example, in some papers you
- 15:56will see the original P value,
- 15:57which is what Thomas listed
- 15:59on the third column.
- 16:00Here as P value.
- 16:01Some people refer this to
- 16:03this as a nominal P value,
- 16:04and some people just refer directly as
- 16:07P value and on the adjusted P value.
- 16:10Some people refer to as FDR.
- 16:12Some people refer to as a Q value.
- 16:14Some people refer to as like
- 16:16a tomasetta pH adjust P value.
- 16:19Some people even will just tell
- 16:20you that it's FDR adjusted P value.
- 16:23So there are many different
- 16:25normal creatures for basically
- 16:27the same things and different.
- 16:29Authors use different ways
- 16:30to refer to those things.
- 16:32Yeah, yeah there is no.
- 16:34Yeah I think yeah there
- 16:35is a lot of redundancy.
- 16:36Let's say now in terminology.
- 16:39And no specific rules.
- 16:42That depends on the reviewers mail.
- 16:45I see OK, so this was like this,
- 16:49uh, an introduction.