Skip to Main Content

Analysis and Interpretation of single cells sequencing data – part 4 Multiple Test Correction

August 25, 2021
ID
6876

Transcript

  • 00:03OK, so today's talk will be mainly
  • 00:06about the analysis of single
  • 00:08cell RNA seekers, so we had.
  • 00:10Last time, like the last three
  • 00:13times with dealt with the analysis
  • 00:15of biker NEC classic methods and
  • 00:18pathway richemond analysis method,
  • 00:20I have some slides because last
  • 00:23time I remember there was there were
  • 00:25questions more specific about the
  • 00:27multiple test correction and so I
  • 00:30have these lines at least to like.
  • 00:34Try to explain two of the methods
  • 00:37that are used in order to do that.
  • 00:40So here.
  • 00:41So we start with the assumption that
  • 00:44we are bound are in our statistical
  • 00:47analysis to that P value threshold of 0.05.
  • 00:50So whenever we run a test on
  • 00:53whatever we want to test and they,
  • 00:56it seems that the most important
  • 01:00thing is that you have a P value of
  • 01:04less than 0.00 less point than 0.05.
  • 01:08So RP value of 0.05 means that
  • 01:11the probability of rejecting.
  • 01:13You're not hypothesis when it's true is 5%.
  • 01:17So for example,
  • 01:19when we are comparing gene expression,
  • 01:21we're performing a test of
  • 01:24differential expression of 1
  • 01:25gene between condition A and B.
  • 01:27It means that they are the null
  • 01:30hypothesis is that the Gina is
  • 01:33not differentially expressed.
  • 01:35So if we have a P value of 0.05,
  • 01:37it means that we can reject the null
  • 01:40hypothesis so that the gene is not changing.
  • 01:43Yeah,
  • 01:44and and the probability of that and
  • 01:46Dolly Parton is to be true is only 5%,
  • 01:48so that's why the lower the P value,
  • 01:50the more confident we are in
  • 01:54rejecting the null hypothesis.
  • 01:56One of the limitations of this
  • 01:57is that the P value doesn't say
  • 02:00anything about our hypothesis,
  • 02:01so that the gene is really
  • 02:04differentially expressed UM.
  • 02:06So when we run analysis in high to put way,
  • 02:10we usually have multiple comparison.
  • 02:12So it means that we run a test of
  • 02:15differential expression differential
  • 02:16expression for each gene and we
  • 02:19usually have 10 to 20,000 genes.
  • 02:21For example, in our NGS experiments,
  • 02:24or when we do pathway richemond analysis.
  • 02:27We're testing their retreat of
  • 02:29hundreds or thousands of pathways.
  • 02:31That means that we run a lot of tests
  • 02:34and For these reasons that probability
  • 02:37that is 5% to have false positives.
  • 02:40So to say that the gene is
  • 02:43differentially expressed,
  • 02:44when in reality it is not rises with
  • 02:47the number of tests that we run.
  • 02:50And so, for example,
  • 02:52here you see a numeric example,
  • 02:54so if you come assume that we you run
  • 02:57a differential expression analysis
  • 03:00on next generation sequencing,
  • 03:03you have a,
  • 03:04you have a collection of data on
  • 03:0710,000 genes and you select 1000 genes,
  • 03:10so 10% with a P value that is less than 0.05.
  • 03:15Now,
  • 03:16since you run 10,000 test and you
  • 03:20accepted the 5% probability to make an error,
  • 03:23that means that you expected by chance
  • 03:26to have a 5% of full 5% of tests that
  • 03:31are true that are significant but not true.
  • 03:37So we expect 500 out of the 10,000 genes to
  • 03:42be picked up by chance.
  • 03:45If we compare so 500 is what we expect
  • 03:48is the number of genes we expect it
  • 03:50to be in our list of differentially
  • 03:52expressed when they are not,
  • 03:54and it's 50% of the 1000 that we find.
  • 03:59So if we calculate 500 out of 1000,
  • 04:04that means that we have a 50% of genes
  • 04:07that we expect to be false positive.
  • 04:10So this ratio is 50% is what we
  • 04:13call the false discovery rate,
  • 04:15and in this case using just these
  • 04:18values is 50%. So it's very high.
  • 04:21It means that potentially one out of
  • 04:23two genes can be a false positive.
  • 04:26So multiple test correction methods are
  • 04:28ways that to modify the original P values,
  • 04:32particularly to increase the original P
  • 04:35values so that this probability of having
  • 04:38false positives is ultimately less.
  • 04:40It is reduced.
  • 04:45Yeah, I just want to make a quick
  • 04:47addition to as this is a very important
  • 04:50concept and the first time I see
  • 04:52encounters a word not have processes.
  • 04:55Always got confused initially or so.
  • 04:57Why is the heck is not hypothesis also
  • 04:59not purposes we can consider As for
  • 05:01example if you're comparing two groups
  • 05:03of samples and not help us is basically
  • 05:06there's there's a 21 gene the gene
  • 05:08you're looking at has no difference
  • 05:09between these two groups, for example,
  • 05:11so this will be done on hypothesis and the.
  • 05:135% P value oh point. 5.050 point
  • 05:1805 means you have a 5% chance
  • 05:21to make a wrong call basically,
  • 05:24so that's that's eventually.
  • 05:25II interpreted that way myself.
  • 05:28Yeah, exactly. So that's the
  • 05:29first positive is a wrong code.
  • 05:31So basically yes, and when you do
  • 05:35pathway richemond Denali part is
  • 05:37is that the pathway is not reached?
  • 05:40So those are the two examples that I'm doing.
  • 05:43Uhm, so all the multiple test
  • 05:46correction methods increase the
  • 05:48original key values with the aim
  • 05:50of reducing these false positives.
  • 05:53What they do not do is that they
  • 05:55do not swap the order of P values,
  • 05:57so you can imagine if you have your
  • 05:59list of genes and your rank your genes
  • 06:01according to the original P values.
  • 06:03When you perform the transformation
  • 06:05or the correction,
  • 06:06you don't have changes in the
  • 06:09rank of the jeans.
  • 06:11So this simplest correction is
  • 06:14the Bonferroni correction method.
  • 06:16It's simple because you just take
  • 06:18the original P value and you
  • 06:20multiply the original P value for
  • 06:22the number of tests that you perform.
  • 06:24So in our case, for example,
  • 06:26if you run tests for 10,000 genes,
  • 06:28that means you take your original
  • 06:30P values and you multiply each
  • 06:32of these P values for 10,000.
  • 06:34So that this is the formula,
  • 06:36there just
  • 06:37have a question I I'm
  • 06:38understanding what you're saying,
  • 06:39but there's adjusted P minus value.
  • 06:43I know it's a P value,
  • 06:44it's inequation, no, no.
  • 06:45It's up evalue with a very large space,
  • 06:48minus yeah. Oh yeah, yeah.
  • 06:53OK, it's devalue so.
  • 06:56I will correct this so it's just evalue.
  • 06:59Yeah, it's the original P value.
  • 07:01Multiply that by the number of tests.
  • 07:04OK, thank you and this belongs to a family
  • 07:08of correction methods that control for
  • 07:11the so called familywise error rate.
  • 07:14So after the correction when you
  • 07:16take everything that is below 0.05,
  • 07:19interpretation of that is that you're
  • 07:23relying 5% of probability to have at least.
  • 07:26One false positive.
  • 07:28So you're controlling for the
  • 07:30probability of having in your final
  • 07:32list at least one false positive.
  • 07:34That's why it's a very
  • 07:37conservative correction,
  • 07:38so it's very stringent because
  • 07:39basically you are not allowing to have
  • 07:42any false positive at all almost.
  • 07:44And so,
  • 07:45especially when you have a
  • 07:47large number of tests.
  • 07:48Since this is the number that you
  • 07:50multiply your original P values for,
  • 07:51it can be very not rewarding,
  • 07:55meaning that after the correction.
  • 07:56Would ever would be values
  • 07:59basically reduced to 1.
  • 08:01And so that no gene after
  • 08:03the correction is selected.
  • 08:04So that's why this is simple to explain,
  • 08:06but it's rarely used.
  • 08:08The most common method that is used
  • 08:11is the Benjamini Hochberg correction,
  • 08:13so this is the most popular
  • 08:15multiple test correction method
  • 08:17was introduced there 25 years ago,
  • 08:19so this is belongs to a family
  • 08:21of methods that designed to
  • 08:23control the false discovery rate.
  • 08:25So the proportion of false positives that we.
  • 08:30One thing that we expect in our data,
  • 08:32so we're not if we select
  • 08:35after the correction.
  • 08:370.05 it means that we allow 5% of our genes
  • 08:42to be wrong calls or false positives.
  • 08:45So it's a stepwise method,
  • 08:47so it's it's a more slightly
  • 08:49more complex than the bond,
  • 08:51then the Bonferroni,
  • 08:53but the formula is quite straightforward,
  • 08:55so it requires a first year that you sort
  • 08:59all your original values in increasing order.
  • 09:02So from the smallest to the biggest value,
  • 09:06so that you can calculate the rank.
  • 09:09So this more length is the smallest
  • 09:11has rank one, and then 2-3 and so on,
  • 09:14so they adjusted.
  • 09:16Failure is basically the original P
  • 09:19value multiplied by the number of tests
  • 09:22divided by the rank of the variable.
  • 09:25So this means that you don't
  • 09:27multiply your original value for a
  • 09:30fixed number as in the Bonferroni,
  • 09:32but the multiplication the amount
  • 09:35of the multiplication depends on
  • 09:37the rank of your original P value.
  • 09:40And I have these examples to show the
  • 09:43difference between the two approaches,
  • 09:45so it's a simplified example.
  • 09:48We run an analysis.
  • 09:49Let's assume we run an analysis
  • 09:52of Honor 5353 jeans.
  • 09:55So these are our genes we're
  • 09:57testing for differential expression
  • 09:59of these genes into conditions,
  • 10:00so these are the original P
  • 10:02values that we get.
  • 10:04For example from let's say at Test
  • 10:07or something that is more specific
  • 10:09for next generation sequencing data.
  • 10:11So it's
  • 10:12any test of differential expression.
  • 10:15So this is the original P value and
  • 10:18I ranked the genes in increasing
  • 10:20values so that these genes level
  • 10:234 is the most significant.
  • 10:25And so on. This, the laster,
  • 10:28has no significance at all,
  • 10:29because the P value is 1.
  • 10:32So this is the Bonferroni formula.
  • 10:33So since we have a 53.
  • 10:36Jeans we run 53 tests and so every
  • 10:39number has to be multiplied by 53.
  • 10:42So after this formula,
  • 10:44that's the result that we get them.
  • 10:47And so if before the correction all these
  • 10:50jeans were below the 0.05 threshold.
  • 10:53After the correction,
  • 10:55only the first jeans is selected.
  • 10:59Because it's the only one that is below 0.05.
  • 11:03Uhm below here you see the same analysis
  • 11:06but with the benjamini Hochberg correction.
  • 11:10So the original P values are the same.
  • 11:13It's a stepwise procedure because
  • 11:16you start from the from the bottom.
  • 11:19And so you multiply that these are,
  • 11:22that is value one for the number
  • 11:24of tests are divided by the rank.
  • 11:26So this is multiplied by one,
  • 11:28so it stays the same.
  • 11:29And that's why it's one.
  • 11:31Also here, then you multiply this
  • 11:34value for the number of tests,
  • 11:3653 for the rank.
  • 11:38So this is slightly more than one,
  • 11:40so you're a little bit increasing the.
  • 11:44The value here from one this is the
  • 11:47result and what I didn't tell you before
  • 11:50is that it's not simply this formula,
  • 11:52but once you get your these results,
  • 11:54you have to check whether this is
  • 11:57higher than the value of the corrected
  • 12:00P value of the genes that precedes.
  • 12:03In this case, it's lower,
  • 12:04so we keep this.
  • 12:05But if this was higher than we
  • 12:08would have kept these this value
  • 12:11and you see this here so here.
  • 12:14I proceed and so we multiply this value
  • 12:17here for 53 / 3 and this is the result.
  • 12:21Now these results here,
  • 12:23the multiplication would give you 0.04.
  • 12:26This is higher than what you obtain here.
  • 12:300.035 and so that's why instead of instead,
  • 12:33instead of,
  • 12:34the result will not be the
  • 12:36exact result of this formula,
  • 12:38but these jeans will take the
  • 12:41value of the gene that precedes.
  • 12:44And so that's where the final P
  • 12:46value adjusted P value will be 0.035,
  • 12:48and that's why when you use this method.
  • 12:52You know, if you look now at.
  • 12:53I mean,
  • 12:54I noticed you can have a lot of
  • 12:56adjusted values that are the same.
  • 13:00Ask a question comma. This is great.
  • 13:02I'm so appreciating your clarifying
  • 13:05everything just on the bottom
  • 13:07where it says BH adjusted P value
  • 13:10and then in parentheses FDR.
  • 13:12Q value clarify all those
  • 13:16different things that FDR.
  • 13:17I know it's false discovery rate in Q,
  • 13:19but is this point 027?
  • 13:22Could it be referred to as the P value,
  • 13:25the FDR and the Q?
  • 13:27So, uh, yeah, this is a
  • 13:29little bit of terminology,
  • 13:30so this notation here tells you how
  • 13:33the P value has been adjusted so
  • 13:35it has been adjusted with the with
  • 13:38the benjamini Hochberg correction.
  • 13:40So since this method belongs to
  • 13:42a family of methods that are so
  • 13:45called false discovery rate methods,
  • 13:47so the the you can interpret the
  • 13:50result also as a false discovery rate.
  • 13:53So that's why sometimes you will not find.
  • 13:57BH adjusted P value,
  • 13:58but false discovery rate and
  • 14:00also any adjusted P value is.
  • 14:02I think it can be also called the Q value.
  • 14:05Got it, thank you.
  • 14:06So that means that FDR can be used also
  • 14:09for with other corrections methods
  • 14:11that belong to the same family,
  • 14:13but they are not benjamini.
  • 14:16Yes, Benjamin Hochberg corrected.
  • 14:19So usually in publication you use
  • 14:22FDR for example and then you specify
  • 14:25in the methods that you use the
  • 14:27Benjamini occupied in order to.
  • 14:30Calculate the FDR.
  • 14:32But sometimes it's left ambiguous most
  • 14:34of the time it will be the Benjamin IAL,
  • 14:37but in any case. Perfect thank you.
  • 14:41And, uh, yeah.
  • 14:42And finally the first gene as you see,
  • 14:45only the first gene has the same
  • 14:46correction as the Bonferroni,
  • 14:48because this is the only case where
  • 14:50these multiplication since the
  • 14:52rank is one corresponds exactly
  • 14:54to the Bonferroni formula.
  • 14:55Unless so,
  • 14:56unless the value of these is higher
  • 14:58than the value of the second jeans.
  • 15:01Because remember,
  • 15:02in this case you take the you
  • 15:04take the minor of the two values,
  • 15:06the formula or the corrected
  • 15:09values of the gene that precedes.
  • 15:12And as you see, in this case,
  • 15:14after you apply the Benjamini
  • 15:15awkward after the correction,
  • 15:17four of the genes are selected because
  • 15:21the adjusted P value is below 5.
  • 15:24So this is an example showing also
  • 15:26that the Bonferroni is much less much
  • 15:28more stringent than the Benjamini awkward.
  • 15:31Because here you you accept
  • 15:335% of a false positive.
  • 15:35Here you accept the 5% probably
  • 15:38to have one false positive,
  • 15:39and that's the difference
  • 15:42in the interpretation.
  • 15:44No,
  • 15:44it is not the one that I mentioned
  • 15:46on the nomenclature is truly a big
  • 15:48issue in the scientific literature
  • 15:50because different people use different
  • 15:52ways to refer to these things.
  • 15:54For example, in some papers you
  • 15:56will see the original P value,
  • 15:57which is what Thomas listed
  • 15:59on the third column.
  • 16:00Here as P value.
  • 16:01Some people refer this to
  • 16:03this as a nominal P value,
  • 16:04and some people just refer directly as
  • 16:07P value and on the adjusted P value.
  • 16:10Some people refer to as FDR.
  • 16:12Some people refer to as a Q value.
  • 16:14Some people refer to as like
  • 16:16a tomasetta pH adjust P value.
  • 16:19Some people even will just tell
  • 16:20you that it's FDR adjusted P value.
  • 16:23So there are many different
  • 16:25normal creatures for basically
  • 16:27the same things and different.
  • 16:29Authors use different ways
  • 16:30to refer to those things.
  • 16:32Yeah, yeah there is no.
  • 16:34Yeah I think yeah there
  • 16:35is a lot of redundancy.
  • 16:36Let's say now in terminology.
  • 16:39And no specific rules.
  • 16:42That depends on the reviewers mail.
  • 16:45I see OK, so this was like this,
  • 16:49uh, an introduction.