Skip to Main Content

Analysis and Interpretation of single cells sequencing data – part 2

August 25, 2021
  • 00:00Yeah.
  • 00:03OK, so today's second part of our
  • 00:06travel through the analysis of a
  • 00:09single cell RNA seek data processing.
  • 00:11Last time we started with the defining how
  • 00:14single cell RNA sequencing works and the
  • 00:18differences between different protocols.
  • 00:20For example, coverage on jeans,
  • 00:22how they isolate cells and so on.
  • 00:25Today we we deal more with the with
  • 00:28the real analysis of the data.
  • 00:31So last time we arrived.
  • 00:33At the point where we saw,
  • 00:36these are starting steps,
  • 00:37we saw that from the molecular point of view,
  • 00:41this strategy is to link the original
  • 00:44RNA molecule with an oligo nucleotide
  • 00:46called the cell barcode that allow us to
  • 00:50identify the cell of origin of the RNA
  • 00:52and then another important part is the UMIA.
  • 00:56Molecular identifier that is a
  • 00:58random nucleotide that allow us to
  • 01:01correct for amplification biases,
  • 01:03so to keep only those duplicate
  • 01:06reads that are belonging to
  • 01:08different molecules in our cells,
  • 01:11and we're not amplified during the PCR.
  • 01:15So after this steps away and after the
  • 01:18mapping we also cover these last time and
  • 01:21we're sorry. I'm question yes.
  • 01:24How can you have the same UMI
  • 01:27and two different RNAs? Oh, I see
  • 01:302015. If you have the same,
  • 01:32Umm I you collapse the reader.
  • 01:34So if the read is the same,
  • 01:36the UMI is the same and the
  • 01:38cell barcode is the same.
  • 01:40You collapse the read
  • 01:41and you know I see I see
  • 01:43the one I'm looking at I see so you could
  • 01:46have the same cell barcode, the same.
  • 01:49Umm I but a different sequence because
  • 01:51you're in a different part of the same RNA.
  • 01:55Well, in theory that depends on the
  • 01:57protocol, because some of those are
  • 01:58only three prime end and so this
  • 02:00is I'm looking at numbers five and six there.
  • 02:06Five and six other reads you mean, yeah?
  • 02:09Well, uhm, yes in theory.
  • 02:11But in theory, yes.
  • 02:12So these would be a different
  • 02:15RNA could be a different gene,
  • 02:17but randomly they have the same.
  • 02:19Umm, I yeah. So in theory it can happen.
  • 02:24It depends on the length of the UMI,
  • 02:27because they are randomly generated.
  • 02:28For example,
  • 02:29if you have if they are 12 nucleotide long,
  • 02:32the probability to have two that are
  • 02:35identical is for elevated at the 12th,
  • 02:38so the longer they are the the lower
  • 02:40is the probability to have two.
  • 02:43Umm with the same sequence.
  • 02:46OK, yeah.
  • 02:49Uhm, OK, so you're my abstract.
  • 02:51Their strategy used to reduce amplification
  • 02:53biases in order to correct for that.
  • 02:56And in single sat there important
  • 02:58because the low because of the
  • 03:00low material we start with that.
  • 03:02That is the content there any content
  • 03:04of a single cell and also the elevated
  • 03:07number of amplification cycles that are
  • 03:09necessary in order to amplify the signal.
  • 03:12So after the mapping of the reason we
  • 03:14arrived at these gene expression matrix
  • 03:17where you have each column represents.
  • 03:19One of the cell of our sample and each
  • 03:22row is a gene and already last time we
  • 03:26saw this fact that if you compare bulk
  • 03:30versus single cell matrix at the single cell,
  • 03:33one is has lower numbers,
  • 03:35lower counts and that means that we
  • 03:38have a higher potential contribution of
  • 03:40noise and also we have a several zeros.
  • 03:44So like 60 to 80% of all the values
  • 03:48will be 0.
  • 03:49And the problem is that many of these
  • 03:52zeros are not biologically true,
  • 03:54so it doesn't mean that the gene
  • 03:57is not expressed in the cell,
  • 03:59but they are technical because
  • 04:01they were not detected during
  • 04:03our any capturing approaches.
  • 04:05So that's the main difference in terms
  • 04:09of number with respect to bug RNA seek.
  • 04:12So the first step said that we
  • 04:15cover all the preprocessing steps.
  • 04:17So after the digital account
  • 04:19matrix arriving at that,
  • 04:21try to remove a basically put cells
  • 04:24that are potentially low of low
  • 04:27quality and and also gene set that are
  • 04:30potentially irrelevant for our analysis.
  • 04:33So the first step in the preprocessing.
  • 04:36Is that we want to remove?
  • 04:39Empty droplets or dying cells, so it could.
  • 04:43It could happen that during the
  • 04:45preparation of our libraries,
  • 04:47some cells,
  • 04:48some droplets are empty or filled
  • 04:51there with the cells that are dying.
  • 04:55So usually I wait to spot these are is
  • 04:58a quality of the data so that we can.
  • 05:01What we can do is we can count the
  • 05:04number of reads or the number of UM
  • 05:07eyes that we detect in each cell.
  • 05:10That's the sum of the number of
  • 05:12unique reads that are aligned for
  • 05:14each cell and we we can rank the
  • 05:17cells from the the one with more.
  • 05:19Umm I with the one with less you MI
  • 05:22and we have this sort of distribution.
  • 05:25And then we can decide to remove the
  • 05:28bottom cells that you see here in red,
  • 05:31the one the cells where the UMI
  • 05:34is number is very low.
  • 05:37So this is a onesie strategy to
  • 05:39remove ourselves where we don't
  • 05:40have coverage of many genes.
  • 05:42We don't have a lot of reads,
  • 05:44and likely the it's.
  • 05:47South of something wrong
  • 05:49during the preparation.
  • 05:50For example,
  • 05:51the IT was the droplet was
  • 05:53slow or this cell was dying.
  • 05:56Another way to capture dying to
  • 05:58remove dying cells is that usually
  • 06:01dying cells are associated with a
  • 06:04high number of reads that mapped to
  • 06:07mitochondrial genes so they have dying.
  • 06:09Cells have extensive
  • 06:11mitochondrial contamination.
  • 06:12And so one can quantify the number of
  • 06:15reads that map to mitochondrial genes.
  • 06:17I think there are 40 genes in the human.
  • 06:22Human cells, they are associated
  • 06:24with the mitochondrial chromosome,
  • 06:25and if these numbers, if the number of
  • 06:29mitochondrial reads is less than 5%,
  • 06:31then you keep the cell.
  • 06:33If it's higher than that 10 or 20%,
  • 06:36then you remove the entire set because
  • 06:38there is a high probability that the
  • 06:41these high numbers contamination is due
  • 06:44to the fact that the cell was dying.
  • 06:49Uhm then, on the other side
  • 06:51we want also to remove.
  • 06:53So all the technique is based on the
  • 06:56fact that we isolate single cell.
  • 06:58But sometimes this doesn't.
  • 07:01Happen properly,
  • 07:02so it means that it can happen that
  • 07:05two cells share the same barcode or
  • 07:07two cells were not physically separated,
  • 07:10so they were included in the same droplet.
  • 07:13For example, if we are using the droplet
  • 07:16approach and so we want to identify
  • 07:19possible doublets and remove those,
  • 07:21so a double letter is a we define doublet as.
  • 07:26A droplet or as a isolation
  • 07:28not of one single cell,
  • 07:31but of two or more sets.
  • 07:33The most common event is that you have
  • 07:36two cells included in the same droplet.
  • 07:39So when you develop are they?
  • 07:41Single cell techniques?
  • 07:43Are there are experimental ways to
  • 07:45evaluate the probability to have
  • 07:47doubles and the approaches that we use?
  • 07:50There are spacious mixing,
  • 07:51so you combine for example population
  • 07:54of human cells and mouse cells.
  • 07:57And then use when you map the reads from
  • 08:00each cell you see or you see how many
  • 08:03for how many cells you have a double mapping.
  • 08:06So how for how many cells some of
  • 08:09your reads mapped to the human genome,
  • 08:12some of your reads mapped to
  • 08:14the mouse genome.
  • 08:15You see here in this plot the mapping of the.
  • 08:20The cells,
  • 08:20so on the human transcript
  • 08:22and on the mouse transcript.
  • 08:24So all these cells are means that
  • 08:27they contain only mouse a cells.
  • 08:29Here they contain only human cells.
  • 08:31What you see here is the identification
  • 08:34of doublets,
  • 08:35because here the content is mixed,
  • 08:37you have something from mouse,
  • 08:39something from human,
  • 08:40and this is likely to be because
  • 08:43one mouse and one human cells
  • 08:45were included in the same droplet.
  • 08:47So the comparison of these two.
  • 08:50Plot is something to say that the
  • 08:52probability to having doublets
  • 08:54obviously depends on the concentration
  • 08:55of your cells at the beginning.
  • 08:58That's why, for example,
  • 09:00here when you have 12.5 cells where we
  • 09:03call it are you have very few events,
  • 09:05only one droplet doublet event.
  • 09:07When you increase the contrast
  • 09:09concentration of cells probably
  • 09:11increase the efficiency of sequencing.
  • 09:12Are your single self because
  • 09:14you have less empty droplet,
  • 09:16but you also increase the
  • 09:18probability you have doublets.
  • 09:20That you see here.
  • 09:21So the number here increase.
  • 09:23So obviously this is possible.
  • 09:25This evaluation is possible because
  • 09:27you are mixing two species before,
  • 09:30but it's not always feasible
  • 09:32in our experiment,
  • 09:33so we need to have a way to
  • 09:35predict the possibility that a
  • 09:38cell was not really a single cell,
  • 09:41but it was a doublet,
  • 09:43so there are computational approaches
  • 09:45that try to evaluate for each of these
  • 09:48cells that we obtain the possibility that.
  • 09:51It's not really a single cell,
  • 09:52but it's a doublet.
  • 09:55So there are many progress,
  • 09:58many,
  • 09:58many procedures that are used
  • 10:01at a common approach is these in
  • 10:04silico simulation of tablets.
  • 10:06This means that you have your
  • 10:10matrix with digital counts
  • 10:12with your cells.
  • 10:14You simulate the doublet by
  • 10:16selecting two random cells to
  • 10:18random cells and combining them,
  • 10:20meaning that for each of these two cells,
  • 10:24you calculate the hypothetical
  • 10:25cell that contains the sum of
  • 10:28the reeds of the two cells.
  • 10:30So this is an in silico tablet,
  • 10:33so you generate thousands of these
  • 10:36in silico tablets and you and the
  • 10:40procedure is to mix these doubles
  • 10:43together with the real cells.
  • 10:45And so that they are analyzed together.
  • 10:48So at some point of the
  • 10:50analysis that we will see later,
  • 10:52cells can be clustered together,
  • 10:54and so for each of the original cell
  • 10:57one can see how many in silico tablets
  • 11:00are in the surrounding of the cell.
  • 11:03So for each cell I can calculate how
  • 11:06many neighbors in the neighborhood
  • 11:08how many real cells there are and how
  • 11:11many simulated tablets there are.
  • 11:13And the principle is that the ratio.
  • 11:15Between the simulated tablets and the
  • 11:18real cells is a score that represents
  • 11:21the possibility the probability of
  • 11:23this cell to be a tablet itself.
  • 11:26So the principle is that if my cell
  • 11:29is surrounded by in silico tablets,
  • 11:31then it's likely a tablet.
  • 11:34If it's surrounded by if the
  • 11:37tablets are all far from my cells,
  • 11:39then probably these cells are not tablets.
  • 11:44Was this step clear?
  • 11:48Kind of sort of somehow
  • 11:49you you teach it what a
  • 11:51doublet looks like,
  • 11:52and then it can find those things,
  • 11:55or you teach it, but a double
  • 11:57it looks like, and it says OK,
  • 11:59I'm certain percentage should be
  • 12:01doubled, yes, so you build the doublets,
  • 12:03taking two random cells. After I got that
  • 12:05part, I just don't understand how
  • 12:07that helps you identify a real one.
  • 12:11Yeah, so the idea is that yeah,
  • 12:13yeah is is that real tablets will
  • 12:15be surrounded by in silico tablets
  • 12:18while a real cells will be far from
  • 12:20the in silico tablets. OK OK, I have
  • 12:23a related, maybe a related question
  • 12:26'cause the idea of a doublet is
  • 12:28that you have jeans from more than
  • 12:30one cell that are being sequenced.
  • 12:33We have this thing that happened
  • 12:35and I'm I'm a basket in general,
  • 12:37'cause I'm assuming it would be true
  • 12:40for other people as well when we
  • 12:42did parathyroid the cells that make
  • 12:44parathyroid hormone have a humongous
  • 12:46amount of PTH as their you know,
  • 12:49main transcript.
  • 12:49The ones that were negative ETH.
  • 12:52All had some PTH,
  • 12:53nothing on the order of like.
  • 12:55Let's say we had 1000 for PTH.
  • 12:57We'd have like three or one
  • 12:59or two in the cells.
  • 13:01That should have been negative.
  • 13:03And it's hard to believe that
  • 13:05every cell in the parathyroid
  • 13:06actually has some RNA in it.
  • 13:08For this parathyroid hormone,
  • 13:09it's much more likely that the cell
  • 13:12that looks like an endothelial
  • 13:14cell really isn't endothelial
  • 13:15cell in those three little reeds.
  • 13:17We're wrong,
  • 13:17but I don't know how that
  • 13:19would have happened.
  • 13:21Yeah, I don't know if that could
  • 13:24be also like a contamination. Uhm?
  • 13:28But if it's three instead of 3000, well,
  • 13:30it that's a good signal to noise ratio.
  • 13:33I would say I. I absolutely agree.
  • 13:35I just thought there was maybe some
  • 13:37general principle in single cells
  • 13:38seek that we needed to look at,
  • 13:40but that's not the case.
  • 13:43No, the only things come coming to my
  • 13:44mind is this for the possibility of it.
  • 13:46Yeah there is. There are some.
  • 13:49Possibility of supernatant
  • 13:50contamination so that you get some
  • 13:52early that is in the solution.
  • 13:54For example, it could be something,
  • 13:56especially if it's abundant,
  • 13:57so it could be.
  • 13:59Thank you that Diane maybe one other
  • 14:02explanation for your finding is that.
  • 14:04That those cells have some
  • 14:06illegitimate transcription going on,
  • 14:07and so you know that could be an explanation.
  • 14:11Yes, absolutely, but that would be real.
  • 14:13That would suggest that endothelial
  • 14:15cells in the parathyroid like to
  • 14:17turn on some parathyroid hormone,
  • 14:19which would be a little weird.
  • 14:21But the definition of illegitimate
  • 14:23transcription is expression of any
  • 14:25gene transcripted any cell type.
  • 14:26I mean that's fine, but you know.
  • 14:30Is that that parathyroid tissue
  • 14:32that was sequenced is not adenoma,
  • 14:34it's normal, it's abnormal, OK?
  • 14:39Diane were those cells like washed
  • 14:41before they were put on the sequencer?
  • 14:43'cause maybe there's.
  • 14:44Maybe somehow some transcripts
  • 14:46are just leaking through.
  • 14:47If there's a lot of them they were.
  • 14:49Yeah, yeah, that gets back to maybe
  • 14:51the contamination. I don't know.
  • 14:53I thought that the machine washed the cell,
  • 14:55but I don't know specifically I.
  • 14:56I'm sure that the person that kind of goes
  • 14:58through all this plumbing to get there,
  • 15:00so it's a little surprising
  • 15:01that would happen, but maybe.
  • 15:04All right, moving on.
  • 15:05I thought it was maybe something
  • 15:07we all needed to know,
  • 15:08but that it seems to be a specific problem.
  • 15:10Sorry, cells were definitely
  • 15:11washed before they went on.
  • 15:17OK, moving on but thank you.
  • 15:21Uh, OK, so the next step after this,
  • 15:24so these were to remove cells that we
  • 15:27didn't want in the following analysis.
  • 15:30The next step is the normalization.
  • 15:33So the normalization, as in any experiment,
  • 15:36has the aim of removing systematic
  • 15:39differences in the quantification
  • 15:41of genes between cells.
  • 15:42So we saw the methods that are
  • 15:45used for the bulk RNA secret.
  • 15:48So the simplest approach is the.
  • 15:51Library size normalization so that each
  • 15:53cell so the the signal for each from
  • 15:56each cell is divided for the total sum
  • 15:59of the Council of the number of reads
  • 16:02or Umm I across all genes for each cell.
  • 16:05So this is simplest approach.
  • 16:07Normalization for the size
  • 16:08of the library for each cell.
  • 16:11The questionable assumption of
  • 16:12these approaches is that you're
  • 16:14assuming that each cell should
  • 16:16have the same number of reads.
  • 16:18This is problematic.
  • 16:19It's problematic in the biker.
  • 16:21I have a secret to assume that
  • 16:24all your samples should have
  • 16:26approximately the same number of RNA.
  • 16:29It's even more, uh,
  • 16:31questionable for the single cell because
  • 16:33we know some cells depending on the
  • 16:36cell type can have different number
  • 16:39of model of RNA molecules depending
  • 16:41on the translation and transcript.
  • 16:44Depending on the transcription activities.
  • 16:47So the alternative,
  • 16:48the main alternatives that are
  • 16:50used to this simplest approach,
  • 16:52is to use a spike in RNA.
  • 16:56Uhm,
  • 16:56there are many benches,
  • 16:58many kids of spiking RNAs that
  • 17:00are now available,
  • 17:01and the assumption for this for
  • 17:04when normalizing for this begin
  • 17:06is that inside each cell there is
  • 17:08the same amount of spike in RNA's.
  • 17:11And then this suggestion that
  • 17:13the common suggestion in the.
  • 17:16Approach is that it's better to use
  • 17:19a single cell specific methods and
  • 17:22it's better not to use the methods
  • 17:25that are commonly used in the
  • 17:27bike and they seek normalization.
  • 17:29The reason for this is that the bulk
  • 17:33methods do not take into consideration
  • 17:36the fact that most of the values are zeros,
  • 17:39and so using by chronic normalization
  • 17:42methods could lead that tool
  • 17:45very stranger size.
  • 17:46Factors.
  • 17:47So all these single set specific methods
  • 17:50somehow take into consideration this
  • 17:52problem of the excessive zeros and
  • 17:54they use different strategies to normalize.
  • 17:57So there are many methods
  • 17:59for the single cell.
  • 18:00Some of those consider instead
  • 18:02of all the single cells pools of
  • 18:05cells so that they normalize that
  • 18:07not each single cell,
  • 18:08but the normalized groups of cells
  • 18:11where the content is summed up,
  • 18:13and this somehow reduces the number of zeros.
  • 18:16And then another methodology is try
  • 18:19to correct to normalize differently
  • 18:21for different groups of genes
  • 18:23depending on whether they are low.
  • 18:26They have low,
  • 18:28medium or high expression levels.
  • 18:31Uh, so the key point here is that,
  • 18:34uh, as usual,
  • 18:36the normalization choices affect the results,
  • 18:38so this is taken from a paper
  • 18:41published last year that was comparing
  • 18:43a different normalization methods
  • 18:46developed for single cell RNA seek data.
  • 18:49So here this is a simple data
  • 18:51set of mouse
  • 18:52embryonic data where you
  • 18:54have two population of cells.
  • 18:57Then Veronica stem cells and the method.
  • 19:01They are colored according to this,
  • 19:03to the two, to the two
  • 19:06populations they belong to.
  • 19:07So what you see is the result
  • 19:10without normalization at all.
  • 19:11So it seems to work quite fine
  • 19:14even without normalizing it.
  • 19:15All this simple normalization here is
  • 19:18the library size normalization and also
  • 19:21this seems to be working quite fine
  • 19:23except for this cell here and then.
  • 19:25You see six different methods that
  • 19:28were developed only for single cell
  • 19:30RNA seek and their divided in.
  • 19:32Two groups are based on the fact that
  • 19:36they require spiking RNAs to worker,
  • 19:38and this is basic German Sam STRT or
  • 19:41they do not require speaking RNAs.
  • 19:44So the general message here is
  • 19:47that depending of these methods,
  • 19:49the separation of this population
  • 19:51change a lot and different methods,
  • 19:54so there is no method that works
  • 19:58better for each data set.
  • 20:00So I would say it's usually important to
  • 20:03to try different methods and depending
  • 20:06on whether you have spikings or not,
  • 20:10the possibility are either limited or not.
  • 20:15This is another excellent yes,
  • 20:16sorry so it's alright.
  • 20:18Yeah, it's actually quite
  • 20:19interesting to see this result,
  • 20:21you know, seems to the simple
  • 20:24normalization is at best in this case.
  • 20:26If simply just judging from how tight
  • 20:29the same cell population is and how far
  • 20:32away two distinct populations should be,
  • 20:34but I would assume this is done by maybe
  • 20:38something like a sort of a Euclidean
  • 20:40distance based measurements, because
  • 20:42if you simply normalize by library size.
  • 20:44If you use a correlation,
  • 20:46that wouldn't change anything, right?
  • 20:47Because the correlation between the
  • 20:49genes will still remain the same,
  • 20:51or between cells will still remain
  • 20:52the same regardless whether
  • 20:54mobilized by library or not.
  • 20:56Yeah, then here. So I didn't see it.
  • 20:59So this anticipation.
  • 21:00So here the visualization
  • 21:02of this cluster is based on
  • 21:05this approach of dimensional
  • 21:06reduction that is called Disney.
  • 21:08So that could also affect.
  • 21:11So these differences that you see here
  • 21:13change also if you change the dimension.
  • 21:17If you change the dimensionality
  • 21:19reduction method.
  • 21:22Used it to plot the results, but I agree,
  • 21:25so here is the simple normalization.
  • 21:26Seems to be one of the most
  • 21:29effective in terms of separating
  • 21:31the two clusters at least.
  • 21:33This is another example with another
  • 21:35data set of mouth longevity, real cells.
  • 21:38So here you have more cluster of
  • 21:40cells corresponding to different.
  • 21:45Differentiation points so
  • 21:47different stages of the embryo,
  • 21:49a 14 and 1618 and then and
  • 21:52then the green are the adults
  • 21:55that cells epithelial cells,
  • 21:57so also here they called the basic A
  • 22:01take home message is that there is no
  • 22:05consensus on which method is best,
  • 22:08and different methods can
  • 22:10lead to different results.
  • 22:16So that in each, depending on the data set,
  • 22:19the methods that perform the best changes.
  • 22:25And what you don't have here is a
  • 22:27methods that are taken from the back,
  • 22:29so they were not considered in
  • 22:31this in this comparison here.
  • 22:36OK, so this was for the preprocessing steps.
  • 22:39Then the post processing the steps.
  • 22:41At this the the after we have the
  • 22:44normalized our normalized data.
  • 22:45We can start the second part of
  • 22:48the analysis and the main steps
  • 22:50here are the dimensional reduction.
  • 22:52So we will see that these data since
  • 22:55they have a lot of rows and columns
  • 22:57there they have a high dimensionality.
  • 23:00This is problematic for the
  • 23:02code for the interpretation.
  • 23:04For the visualization and also for the,
  • 23:07uh, uh, running a computational
  • 23:10procedures because it can take
  • 23:13it can take a lot of time,
  • 23:15so the reduction to a medium
  • 23:18dimensional space is usually performed
  • 23:20or performed on the genes so that
  • 23:23instead of having a 10,000 genes
  • 23:26that we have at this point we have
  • 23:291030 dimensions and we will see that
  • 23:32these dimensions can represent.
  • 23:34Combination of different genes.
  • 23:36But the key point is that you reduce the
  • 23:41number of dimensions from 10,000 to 10.
  • 23:45So this is the first important step,
  • 23:48so I will speak about this quite in detail.
  • 23:52So the problem is this curse of
  • 23:54dimensionality so that we have 2010
  • 23:57to 20,000 genes as features and
  • 23:59depending on our experiment we have
  • 24:0210,000 up to 1,000,000 of cells that
  • 24:05we want to analyze and to consider.
  • 24:07So we need to reduce the number of features,
  • 24:11in particular the number of genes,
  • 24:14the rational.
  • 24:14Is that there are two two points
  • 24:17for the rational.
  • 24:19The first is that not all the
  • 24:21genes are important.
  • 24:23If our aim is to classify cells according
  • 24:25to their differences in expression,
  • 24:28not all genes are important.
  • 24:30So for example for sure genes that are
  • 24:33never expressed are not important,
  • 24:35but also housekeeping genes that are
  • 24:38always expressed at the same level
  • 24:41are also not important in separating
  • 24:43the cells and we select these.
  • 24:45Jeans for this point that through
  • 24:48feature gene selection,
  • 24:50then the second point is that many
  • 24:53genes are correlated in expression,
  • 24:55so it's redundant to have two
  • 24:57genes that are highly correlated
  • 24:59as two separate information.
  • 25:02We can combine them into one dimension.
  • 25:07And these correlation is taking care during
  • 25:10the dimensionality reduction approaches.
  • 25:12So for this election,
  • 25:14for the first step,
  • 25:15selection of genes that are important.
  • 25:20The aim is to select the genes that
  • 25:22contain useful information about the
  • 25:24biology of the system and so they
  • 25:26are the genes that have difference in
  • 25:29expression between different cells
  • 25:30and we want to remove genes that
  • 25:33contain either only noise because
  • 25:35they have low expression level and
  • 25:37so all the variation is noise or the
  • 25:40jeans that do not have variation among genes.
  • 25:42So the housekeeping genes.
  • 25:44And the simplest approach to do
  • 25:47that is to calculate for each gene
  • 25:50a sort of measure that is a variance
  • 25:54corrected for the mean.
  • 25:56So we have seen something similar.
  • 25:58Also during the lesson for the bulk RNA seek,
  • 26:02because the approach is not so different.
  • 26:05So you rank GS, you build a model.
  • 26:09Each dot. Here is a gene,
  • 26:11and you expect the variance of the gene.
  • 26:15To be proportional to the
  • 26:17average expression of the gene,
  • 26:19meaning that the more the gene
  • 26:21is expressed that the more random
  • 26:24fluctuation fluctuation you also expect.
  • 26:26So you build a sort of model that
  • 26:29captures a random variations
  • 26:31that you expect in your genes,
  • 26:33and then you see which genes are outliers
  • 26:36so they show more variance than the
  • 26:40baseline variance that is based on the
  • 26:43noise or or on the random variation.
  • 26:45In expression and those genes that
  • 26:48are highly variants are the ones that
  • 26:50you select for further analysis,
  • 26:52because there are those genes where you
  • 26:55don't have only technical variations
  • 26:57or but you have biological variation.
  • 27:00The questionable assumption here is
  • 27:02that the biological variability is
  • 27:05higher than the technical variability,
  • 27:07because the assumption here is that
  • 27:10all these outlier genes that show
  • 27:13higher variance than the average are
  • 27:15important because this higher variance
  • 27:18is biological variance and obviously
  • 27:20also here as in some balcony approach,
  • 27:24you could have some methods that
  • 27:26penalize jeans having high variance,
  • 27:29but lo Mein.
  • 27:30Because you don't trust them so much,
  • 27:32but the assumption is that you
  • 27:34calculate a measure of variance and
  • 27:37you consider the top variant genes.
  • 27:39And you remove the others from the analysis.
  • 27:45Then there is the dimensionality reduction,
  • 27:47so this is a family of approaches
  • 27:49that are using complex data.
  • 27:51To reduce the number of dimensions of
  • 27:55the data so this has a double purpose,
  • 27:58as I say that to help the analysis
  • 28:02downstream analysis because the
  • 28:04reducing the dimension speed the
  • 28:06calculation times and also to
  • 28:09help the visually visualization.
  • 28:11Especially when you report single
  • 28:13cell data they need is to show data
  • 28:17in simple and interpretable output.
  • 28:19So usually this is a 2D plot.
  • 28:22And so they mentioned I did.
  • 28:24Action are also used in order to
  • 28:27compress high dimensional information
  • 28:29so that it can be presented in a 2D
  • 28:32plot and the two are different needs.
  • 28:35There are multiple methodologies.
  • 28:36Each one has different advantages
  • 28:38and limitations,
  • 28:39so the classic example of a dimensional
  • 28:42reduction that we always have in mind
  • 28:45and possibly historically speaking,
  • 28:47is one of the oldest is when you have a
  • 28:51problem to draw a 2D map of the Earth.
  • 28:55So Earth is 3D and you want that
  • 28:572D map that keeps most of the
  • 29:01reliable information on the geography
  • 29:03on the geography of Earth.
  • 29:05So where the continents are placed,
  • 29:08their shapes, their areas and so on,
  • 29:11so there are different approaches,
  • 29:13many different approaches to convert
  • 29:15the 3D map of Earth into 2D maps,
  • 29:19and for example here you see one
  • 29:21of the most famous projections that
  • 29:24is called Mercator projections.
  • 29:26So this was developed in the 16th century,
  • 29:29and it's the one used by sailors because
  • 29:32it keeps the directions and shapes.
  • 29:36So it's a good map to know
  • 29:38what is North East or West.
  • 29:40And the problem is that there is a high
  • 29:43distortion in this map of areas so that
  • 29:45the the more you are far from the equator,
  • 29:48the more areas that seems
  • 29:50larger than they are.
  • 29:51And so for example here it seems
  • 29:53that Greenland is bigger than
  • 29:55the whole of South America.
  • 29:57That is not so. This is the distortion.
  • 29:59So other projections such as these two.
  • 30:03And they are projections where
  • 30:05the the area is preserved,
  • 30:07and so that this area corresponds really
  • 30:09to the smallest areas with respect.
  • 30:12For example to South America.
  • 30:14But these kind of maps do not
  • 30:16preserve shapes and direction,
  • 30:18so that the common point is is any
  • 30:21projection will be will distort
  • 30:24the some of the features.
  • 30:26So reduction of dimensionality is
  • 30:28always an approximation and it brings it
  • 30:31brings some distortions and deviations.
  • 30:33And as a for the Earth map,
  • 30:35we have different approaches also
  • 30:37for our single cell data we see
  • 30:40there are different techniques.
  • 30:44Is this clear?
  • 30:49Anyway, it's a very good analogy,
  • 30:50so this is awesome feedback.
  • 30:54So the first one that we will see with a real
  • 30:57example is principal component analysis.
  • 30:59So in our case we are studying cells
  • 31:01based on the expression of genes.
  • 31:03So in that in our simple example
  • 31:05we will have six cells and since
  • 31:07they are simple cells they have
  • 31:09only they express or women age 25,
  • 31:11only four genes.
  • 31:12And So what you see here is the
  • 31:15expression level of each gene
  • 31:17from A to D in this six sets.
  • 31:20So now we can use the expression levels,
  • 31:23uh, so in in as a way to map cells
  • 31:26and the expression level of each
  • 31:29gene is a different dimension.
  • 31:31So in this case we have a four
  • 31:35dimensional space that obviously
  • 31:36we cannot plot a on a 2D plot.
  • 31:39So one simple so we could plot
  • 31:41on a two diploid cells based on
  • 31:44the expression of two genes,
  • 31:47and so we can take jeanae engine be.
  • 31:50And build these sort of map of
  • 31:53these cells based on the expression
  • 31:55level of gene eight that is our X
  • 31:58axis and gene B that is our Y axis.
  • 32:01And here you see where cells are
  • 32:04located according to the expression
  • 32:06of eight of these two genes.
  • 32:09So the expression of each gene
  • 32:11is a dimension.
  • 32:13So now with this weekend plot,
  • 32:15two genes in a 2D map,
  • 32:18and so for performing
  • 32:20principal component analysis,
  • 32:21what is usually done at the beginning
  • 32:24is to center the measurement,
  • 32:26meaning that these genes here
  • 32:29they have an average expression
  • 32:31of seven for these jeans.
  • 32:33This is the average of jinei
  • 32:36across the six cells.
  • 32:38And so on.
  • 32:39So these gene B has an average
  • 32:42of 4.5 GC of six and so on.
  • 32:44So centering the data means that
  • 32:47you calculated the mean expression
  • 32:49of the gene across all these cells
  • 32:51and you subtract the mean from all
  • 32:54the values of the gene so that you.
  • 32:57Switch from this matrix that is
  • 32:59not centered to this matrix that
  • 33:02is centered around 0.
  • 33:03So I simply from the top
  • 33:05row I subtracted seven,
  • 33:07so 11 -- 7 is 4 and so on.
  • 33:10From the second I subtract 4.5 and
  • 33:12so on so that you see in the centered
  • 33:16values are also negative and the common
  • 33:18point is that the mean for each gene is 0.
  • 33:23So usually before performing
  • 33:25like I mentioned,
  • 33:26I did action.
  • 33:27This centering is is performed and
  • 33:29it's also helpful in the visualization.
  • 33:32So before centering,
  • 33:33the cells were were looking like this.
  • 33:35After the centering,
  • 33:37these are the new coordinates,
  • 33:38so nothing changed that it's only the
  • 33:41origin of the axis and the position of
  • 33:43the zero that are that are different.
  • 33:46But if you look at the cells,
  • 33:48the points are exactly in the
  • 33:51same position as before.
  • 33:53No one question that we can
  • 33:55ask here is weather.
  • 33:57So what you see here is that the
  • 34:00difference here of the cells.
  • 34:02We can capture the difference of
  • 34:04these cells because they differ
  • 34:06in the expression of A&B and one
  • 34:09question we can ask is whether it
  • 34:11is better to weather GTA or gene bit
  • 34:14is better in separating these cells.
  • 34:16So this corresponds to asking
  • 34:18how much of the variability of
  • 34:21the data is associated with this.
  • 34:23Progression of GD or with
  • 34:25the expression of Gene D.
  • 34:28And so the question is,
  • 34:30what is the variation of these six
  • 34:33points that is associated with gene
  • 34:37expression engine B expression?
  • 34:39So there is a simple way to calculate
  • 34:42the variation associated that
  • 34:44corresponds to the formula of the variance.
  • 34:48So this is an example to calculate the
  • 34:51variation that is associated with gene 8.
  • 34:54So here we're considering the X axis here,
  • 34:58so I can draw a projection from each
  • 35:01cell to these axis and calculate the
  • 35:05distance from the origin to each cell.
  • 35:08And basically,
  • 35:09since here where we centered the data,
  • 35:12the distance basically correspond
  • 35:13to these expression level.
  • 35:15So cell one has a distance of four cell,
  • 35:18two or distance of five and so on.
  • 35:22Now if we want to measure the
  • 35:25variation with the variance formula,
  • 35:27the variance formula is to take
  • 35:29the square of each of the distance
  • 35:32of these six distances.
  • 35:34Some disc wears and then divide everything
  • 35:37by the number of observation minus one.
  • 35:40So this is how we calculate the
  • 35:43variance of the expression of GD.
  • 35:45So the formula here is the following.
  • 35:48So we take the six distances,
  • 35:51we square the distances.
  • 35:52With some the results and we divide by 5.
  • 35:56So the variance of jinei is 30.8.
  • 36:00We can do the same for jinbe.
  • 36:03And in order to have the variance
  • 36:06associated with Gene B Now looking
  • 36:09at this blotter A and C,
  • 36:11it seems by I that gene A has
  • 36:14more differences,
  • 36:14higher variance than Gene B,
  • 36:16and you can see these also by
  • 36:19looking at the range of the axis
  • 36:22minus 6 to 6 -- 4 -- 4 to four.
  • 36:26So we can so the variance of GA is 30.8.
  • 36:30In this case,
  • 36:31the variance of gene B is less in this.
  • 36:34Cases in 8.3.
  • 36:37So the calculation of the
  • 36:39variance of Steam V is the same,
  • 36:41but instead of projecting
  • 36:43cells on the X axis,
  • 36:44we project cells on the Y axis and
  • 36:47that's how I come with these results.
  • 36:50Now we can see that if we consider
  • 36:52the global variance of our data
  • 36:54along these two dimensions,
  • 36:56we can say that.
  • 36:59That that the expression of gene A
  • 37:02contains 80% of the global variance.
  • 37:05And the expression of Gene B
  • 37:08contains 20% approximately of the
  • 37:11whole variance where the whole the
  • 37:15whole variance is just 30.8 + 8.3.
  • 37:18So if now I have to select only one
  • 37:21of these dimension based on the on the
  • 37:24fact that variation is information,
  • 37:26I would select Jeannie.
  • 37:27So if I have to drop one of the
  • 37:30genes I would drop Gene B because
  • 37:32it contains less information,
  • 37:35less variance than Jenny.
  • 37:38Now the question for PCA is
  • 37:41whether it is whether is there a
  • 37:44line that is not jeanae origin B.
  • 37:47It's not one of these that captures
  • 37:51more variation that maximizes
  • 37:53the variation that is captured.
  • 37:56So the question is to try to calculate
  • 38:00the variance that is associated
  • 38:02with each of these possible lines
  • 38:05in the same way as we did here.
  • 38:08But the changing the line and
  • 38:11so changing this calculation.
  • 38:13So this is a problem of like minimization
  • 38:17of the distances or maximization of the.
  • 38:21Of the various.
  • 38:23And so we can find that
  • 38:26among all the possibilities,
  • 38:27the line that maximizes the
  • 38:29variance for our data.
  • 38:31In this case this is the line.
  • 38:35That maximizes the variance,
  • 38:37and basically what we found is the
  • 38:40principal component want of our data.
  • 38:43So principal component principal
  • 38:45component one is exactly that the
  • 38:48that the dimension that maximizes
  • 38:50the variance of data with respect
  • 38:52to all the other possibilities
  • 38:55toward the other possible lines.
  • 38:57In this case that cross the origin.
  • 39:02Now once we identify PC one PC two,
  • 39:05so the second principle component
  • 39:06is the line that is orthogonal to
  • 39:09the first step and this is easy
  • 39:11because we are in a case where
  • 39:13we have only two dimension so.
  • 39:16The second principle component
  • 39:18is simply the the line that is
  • 39:20orthogonal to the principal component,
  • 39:23one that we found.
  • 39:24So once we identify this principal component,
  • 39:27now we can represent our data
  • 39:29not from the point of view of our
  • 39:32original jeans of the expression
  • 39:34of our original jeans,
  • 39:35but from the point of view of
  • 39:38a principal component want and
  • 39:40principal component tool.
  • 39:41So this means that we are rotating the data.
  • 39:45In this way,
  • 39:46so that now our reference system
  • 39:49system of reference is not given by
  • 39:52our regional expression but by PC1 and PC2.
  • 39:57But the data are always dissing.
  • 40:00They didn't change their respective
  • 40:02localization, so we just rotated the data.
  • 40:05Now the advantage of doing this is
  • 40:08that now if we calculate the variance
  • 40:12associated with PC one and PC two,
  • 40:15we can see that a difference with respect
  • 40:19to our original to our regional dimensions.
  • 40:22So we can see that PCA captures almost 100%.
  • 40:27Of the variance of our data
  • 40:30while PC two captures much less.
  • 40:33And this is because.
  • 40:35Exactly because PC one was
  • 40:38selected because it was maximising
  • 40:40my maximising this value here.
  • 40:43So and here you see the difference
  • 40:45between the variance with the
  • 40:47original dimension gene and gene B,
  • 40:49and with the new principal components.
  • 40:51So the advantage of the technique is that
  • 40:53now if I want to drop one of the dimension.
  • 40:57So if we want to pass from 2
  • 40:59dimensions to one dimension,
  • 41:01if I select PC one,
  • 41:02I lose a less than 5% of the information,
  • 41:05while with the original gene and gene B,
  • 41:08if I choose ginae I had to lose
  • 41:1120% of the information in this way.
  • 41:13I reduced dimension.
  • 41:14It can reduce the dimension from 2
  • 41:17dimensions to one that keeping almost
  • 41:20all of the information of the data.
  • 41:22And this is the trick used by
  • 41:26principal component analysis.
  • 41:27Ah, so.
  • 41:28This is a more complex example,
  • 41:31so this was an example with four dimensions.
  • 41:33If you remember our regional,
  • 41:35our original table was with four genes,
  • 41:37so we can do the same with four jeans.
  • 41:40With four dimensions we can calculate
  • 41:43the original variance associated
  • 41:44with each of the original jeans.
  • 41:46So Gene age in BC and D expressed as
  • 41:48a percentage of the entire variance.
  • 41:51And again,
  • 41:51if I had to choose the two genes
  • 41:54containing most of the variance,
  • 41:56I would choose jeanae engine, see.
  • 41:58But still I would lose.
  • 42:0010% of the variance associated with
  • 42:03Gene B and 20% associated with Gene D.
  • 42:06Like if I perform principal
  • 42:08component transformation,
  • 42:09I found I find four principal
  • 42:11components in a way that the first
  • 42:14step maximizes the explained variance.
  • 42:17II is orthogonal to the first,
  • 42:20and maximising maximizes the
  • 42:21residual variance and and so on.
  • 42:24So the advantages that now if I consider
  • 42:27these two components and they remove it.
  • 42:30These two I only lose that like 3 to 4%
  • 42:34of the variance and I can keep more than 90%.
  • 42:39While here I could keep
  • 42:41only 70% of the variance.
  • 42:44And if I consider only these two
  • 42:47dimensions and I plot my data,
  • 42:50my cells here, I can obtain this plotter.
  • 42:54So these are are the original cells
  • 42:56based on the expression of these
  • 42:59four genes plotted in the first two
  • 43:02principal component where dimension
  • 43:04one explains 74% of the variance
  • 43:07and dimension to explains 23%.
  • 43:09This corresponds to these values here.
  • 43:14And the advantagous PCA. So this.
  • 43:16So again,
  • 43:16the trick was to reduce the space
  • 43:19from four to two dimensions,
  • 43:21but keeping most of the information.
  • 43:24And so they they new dimensions,
  • 43:26dimension, PC one and PC two are
  • 43:29combinations of linear combinations of
  • 43:31the old dimensions and the advantage
  • 43:33of PCA is that I can easily calculate
  • 43:36how much the expression of the original
  • 43:39jeans is important in each of the
  • 43:41newly found principal components.
  • 43:43For example, in a plot like this.
  • 43:46And this is the this is a plot that shows
  • 43:50that principal component one captures
  • 43:52a lot of the expression of Gene 8.
  • 43:55B&C while Gindi is not very
  • 43:59important in principal component
  • 44:01one while principal components,
  • 44:04who is mainly capturing
  • 44:06the expression of Gene D.
  • 44:10And in this example,
  • 44:11a explanation of this is if you look
  • 44:14at the original values that gene
  • 44:16AB&C are very highly correlated,
  • 44:18so they're highly expressed in
  • 44:20the first three cells and low.
  • 44:22They have low expression in the
  • 44:25in the four to the 6th cells,
  • 44:27while gindi is a little bit
  • 44:29different because gene is highly
  • 44:31expressed in cell 24 and five and
  • 44:34low expression in 1/2 and three.
  • 44:36So this means that Gene D is not correlated.
  • 44:40With the expression of the other genes,
  • 44:42so that's why using PCA I can capture
  • 44:45the correlated expression of these three
  • 44:47genes in the first principal component.
  • 44:50And and the the.
  • 44:52Uhm, expression of Gene D that is
  • 44:56different and not correlated with the other.
  • 44:59Using the second component.
  • 45:01The second dimension,
  • 45:03obviously in the real case scenario
  • 45:06we start from,
  • 45:07if we start from 3000 genes,
  • 45:10we start from 3000 of dimensions.
  • 45:13But if you look at PCA
  • 45:16plots up sometimes you can.
  • 45:19You can always find also
  • 45:21the percentage of variance.
  • 45:23That is explained from each dimension
  • 45:26so you can see how much of the entire
  • 45:30information of the data can be
  • 45:33explained only using two dimensions
  • 45:35and how much you are missing.
  • 45:40Uh, no PCA is, uh,
  • 45:42it was worth explaining because he's
  • 45:44still one of the most used at techniques.
  • 45:47Also in single cell data analysis.
  • 45:49But you don't see PCA offer in the
  • 45:52visualization of single cell data.
  • 45:54And that's because the principal
  • 45:56component analysis, as I said,
  • 45:58has the advantage of being highly
  • 46:01interpretable because from the
  • 46:03components I can go back quite
  • 46:05easily to the to the original jeans,
  • 46:07so I can establish which genes are important
  • 46:11are important in each of the dimensions.
  • 46:13It is computationally efficient,
  • 46:15but when I want to visualize a
  • 46:18single cell RNA seek data, it's.
  • 46:22It's lesser it's not very
  • 46:24appealing to the eye,
  • 46:25so and the reason for this is again,
  • 46:28that the data in single cell are nonlinear.
  • 46:31They have an excess of zeros,
  • 46:33and so if you plot the principle,
  • 46:35the first two principal components,
  • 46:37often you don't have a clear
  • 46:39separation of cells,
  • 46:40and that's what you want to show,
  • 46:43especially if you want if you're generating.
  • 46:47Figure that is going to represent your data.
  • 46:50So for this reason,
  • 46:52mainly for the visualization,
  • 46:53not for the analysis of the data,
  • 46:56come in the first year of single
  • 46:59cell analysis of the most employed
  • 47:02approach was called the Disney,
  • 47:04so it's at least a caustic
  • 47:06neighborhood embedding.
  • 47:07So this approach is not linear,
  • 47:10as principal component is
  • 47:12based on graph methods.
  • 47:13So on this I will not spend.
  • 47:17A lot in explaining how it works,
  • 47:20but basically it's a random procedure and
  • 47:24being nonlinear it means that it corrects.
  • 47:28The original data using the
  • 47:30nonlinear equation equation and
  • 47:32the advantage is that it's better
  • 47:34in showing clusters of cells,
  • 47:36so that's it.
  • 47:37It's able to retain the local
  • 47:40structure of the data in low dimension
  • 47:43where the low structure local
  • 47:45structure means cluster of cells
  • 47:47that are very similar to each other.
  • 47:50The disadvantage is that it's a
  • 47:53stochastic method so that each iteration
  • 47:56can produce a different result.
  • 47:58That's not true for PCA.
  • 47:59It has a long time to run,
  • 48:01especially when you increase the number
  • 48:03of cells and it's considered to be bad then.
  • 48:06In keeping the global structure of the data,
  • 48:10and I have an example of design,
  • 48:13so this is a data set with a.
  • 48:16I think it's balcony seeker samples
  • 48:18from different cancer from the.
  • 48:20So each color here is a sample
  • 48:23from a different cancer type and
  • 48:26then the same data where run twice
  • 48:29with the Disney approach and these
  • 48:31are the two outputs so you can see
  • 48:35that that something is conserved.
  • 48:37Between the two run.
  • 48:38Uh, so the number of cluster and they come,
  • 48:42the size of the class and probably they they.
  • 48:45The assignment of each sample to
  • 48:47each cluster has been conserved.
  • 48:49Them and also the shape of this
  • 48:51single clastres somehow closer.
  • 48:53But if you look at the organization of
  • 48:56the whole cluster of the of the classes,
  • 48:59that is different.
  • 49:00For example,
  • 49:00these orange cluster here in
  • 49:02run one is in the
  • 49:04middle and while uh and and green
  • 49:07is opposite to read the wild.
  • 49:09Red and green are very near to each other,
  • 49:13so for the capturing the class time,
  • 49:16visualizing the class set
  • 49:17this method it's good.
  • 49:19But then if I start the interpreting
  • 49:21the distance between different clusters,
  • 49:24these methods is not any more valuable,
  • 49:26so it's not reliable because it depending
  • 49:29on the initial random step of the analysis,
  • 49:32it could lead to different maps
  • 49:34and that's the main reason.
  • 49:36Yeah, I'm sorry, yeah.
  • 49:39I was just wondering so like is there
  • 49:41any value in like sort of running the
  • 49:44program like a bunch of times, right?
  • 49:46Like just iteratively and then
  • 49:47taking the average of the distances?
  • 49:49There's some truth that emerges there
  • 49:51when you like repeat it a whole bunch
  • 49:53of times or it's just not useful.
  • 49:57Uhm, I don't think I can answer
  • 50:00to that personally, so I wouldn't
  • 50:03know the answer to this question.
  • 50:07Uhm, I don't know if anyone tried,
  • 50:10so there is a way to reproduce
  • 50:12the analysis to performing so
  • 50:14called of pseudorandom analysis,
  • 50:16meaning random analysis.
  • 50:17When you run a program is based on
  • 50:20a seed that is a random number,
  • 50:22but it can be kept.
  • 50:24It can be remembered during the different
  • 50:26iterations and if you keep the seed
  • 50:29constant you can reproduce results,
  • 50:31but that's not.
  • 50:32So that's a way to keep
  • 50:34consistent the program if you run.
  • 50:36The program in different on
  • 50:38different machines, for example,
  • 50:40but I don't know if that.
  • 50:45If this has been done,
  • 50:46maybe so, but I don't know what
  • 50:48would be the result of that.
  • 50:50So to run it a lot of times
  • 50:52and trying to capture a sort
  • 50:53of stability in the distances,
  • 50:55sure sure. So in the field,
  • 50:59are the fact that if, for example,
  • 51:02if you look at the publication so you can,
  • 51:05you can like data the analysis according
  • 51:08to the method that they used to visualize.
  • 51:11So if you look at some plot and
  • 51:13it's at Disney probability analysis
  • 51:16is before a 2018 2018 because in
  • 51:192018 the what is now the the the
  • 51:22most used approach to visualize
  • 51:24the single cell data has been
  • 51:27presented and that's the you map.
  • 51:29The human method.
  • 51:30So if you see a plot that is uses as
  • 51:34a dimensionality reduction technique,
  • 51:36the human deaths from
  • 51:382018 to now more or less.
  • 51:42So the humor is also another
  • 51:44question is also common, sort of, uh,
  • 51:47so I've actually noticed that if you,
  • 51:50for example Disney,
  • 51:51if you just have a completely random data,
  • 51:54so let's say, OK, I generate the a
  • 51:58computer generated completely random data.
  • 52:01Supposedly it will be a fuzzy
  • 52:03ball in this PC plot,
  • 52:04but if you do in the Disney it will
  • 52:07become some kind of patterns you can
  • 52:09start to see emerging from that.
  • 52:11I just wonder for things like
  • 52:13you map and other things. Is
  • 52:15this also the same problem or
  • 52:17it's a yeah it's the same problem
  • 52:20so all these methods because all
  • 52:22these methods like try to maximize.
  • 52:25The separation of these objects and
  • 52:27the the problem is that the boss 40s
  • 52:30and also for your map you have noise.
  • 52:33So if your differences mainly driven
  • 52:35by noise, UM, they tried, they they.
  • 52:37They basically create patterns from noise,
  • 52:39yeah. And this is a less a problem
  • 52:43with the PCA. That's right, yeah.
  • 52:45So this is not solved by the human,
  • 52:48So what it seems to be solved by
  • 52:50the UMAP is mainly that it's faster.
  • 52:52Uh, it's faster than his name,
  • 52:55so it can be applied in a reasonable time.
  • 52:58So when the data set is very high in
  • 53:01terms of number of cells and also it
  • 53:04seems to be better in a preserving
  • 53:07this global structure of the data so
  • 53:10that also the distances between the
  • 53:12different clusters are more like reliable,
  • 53:15I think there is actually a parameter
  • 53:17where you kind of tune how much you are.
  • 53:21You give weight to the.
  • 53:23Local structure or to the global structure,
  • 53:25but it's generally considered to be more
  • 53:28reliable on the global structure of the data,
  • 53:31so it's considered to be like a
  • 53:33trade off a good tradeoff between
  • 53:36the PCA and the Disney approach.
  • 53:39Barca, it hasn't.
  • 53:40So both these methods have problems.
  • 53:42For example in interpret ability.
  • 53:44So PCA is easy to go back to the
  • 53:48original jeans in Disney and
  • 53:50you map it's very problematic.
  • 53:53And uhm,
  • 53:53yeah.
  • 53:54And also you map is a random
  • 53:56so different runs,
  • 53:58so give you slightly different results.
  • 54:02Can I ask you another quick question?
  • 54:05So when you say the difference in
  • 54:07time for like processing the data,
  • 54:10what is the scale of that time?
  • 54:12Are you saying like hours or days?
  • 54:16Ah well, it did, but it's,
  • 54:18uh, well it diverted.
  • 54:20Depending on the number of,
  • 54:21uh, cells, so it could be
  • 54:24that if you have 100 cells,
  • 54:26you don't notice the difference,
  • 54:28but scaling. So adding data you.
  • 54:32You delete a lot, so for this day
  • 54:34I know that there are a lot of
  • 54:36variation of these names that have
  • 54:38been working on the efficiency,
  • 54:39so they are faster. Uh, but, uh.
  • 54:42I guess it's also problem of memory,
  • 54:46so personally I never run an analysis
  • 54:48on a sample that was more than
  • 54:5020 thirty thousand cells and so
  • 54:53personally I don't know how problematic
  • 54:55it is to work with Disney with a
  • 54:58large data set of 1,000,000 cells.
  • 55:00But the problem is that the
  • 55:02more cells you are,
  • 55:03the more you gain in time.
  • 55:05Using you map against at least basic Disney.
  • 55:09Sure. OK, so this is only technical.
  • 55:13It's not really.
  • 55:14So the the key point also is that depending
  • 55:18on the dimensionality choice you make.
  • 55:22There is an answer look different,
  • 55:24so this is the same data set up,
  • 55:27so normalization is the same.
  • 55:29The input data was the same,
  • 55:31it's from a mouse brain.
  • 55:33So here you see some populations
  • 55:35that correspond to neurons,
  • 55:37different types of neurons and microglia,
  • 55:39and solar cells.
  • 55:40And you see the representation of these
  • 55:42datasets using principal component analysis.
  • 55:44So cells here are colored according
  • 55:47to the cell type and and and you
  • 55:50see that you can see the classes.
  • 55:52But you but but for example,
  • 55:54points within the same clusters.
  • 55:56That kind of spread around,
  • 55:58so that's why for the visualization
  • 56:00of the cluster this is less
  • 56:03clear than the other two methods.
  • 56:05So these two methods try basically
  • 56:08maximizes the completeness of
  • 56:10the data inside the cluster.
  • 56:12And using different two different approaches.
  • 56:19And then again, for interpreting these data,
  • 56:22it's more about just seeing
  • 56:23like how cells are similar to
  • 56:25each other within a cluster,
  • 56:27like how they cluster separately
  • 56:29as opposed to like the distances
  • 56:31between the cluster being able
  • 56:32to infer any relationship from
  • 56:34that distance, right? Well,
  • 56:36here you could be interested also well,
  • 56:38for sure if you use Disney,
  • 56:40the distances between them as you see here,
  • 56:43the distances tend to be like.
  • 56:46More or less the same,
  • 56:48so their equally distributed button
  • 56:50here in new map you can have a
  • 56:54distances that are that have a
  • 56:56range so low to high distances.
  • 56:59So here it can be informative.
  • 57:01But basically when you use
  • 57:03this for the visualization,
  • 57:05what you want to communicate is
  • 57:08that you identify the clusters of
  • 57:11different cells and you want to be
  • 57:14able to see them where they are in.
  • 57:17Which relation they are?
  • 57:18How many cells belong to each cluster,
  • 57:20and so on.
  • 57:21So usually these then are annotated
  • 57:23with the name of the cluster based
  • 57:26on marker genes and so obviously
  • 57:28these for the visualization are
  • 57:29better if you want to label
  • 57:32your clusters and then this.
  • 57:33But PCA is still the common tool,
  • 57:36one of the most common tools that
  • 57:38are run in the downstream analysis,
  • 57:40meaning that Disney new map I
  • 57:42really used mainly for visualization
  • 57:43of data but nothing else.
  • 57:45Specially it isn't but PCA.
  • 57:47Is the basis for the clustering and
  • 57:50for the trajectory analysis and so on.
  • 57:52So still all the pipelines many of the
  • 57:55old departments are still using the PCA,
  • 57:58it's just for the visualization that
  • 58:00they use these alternative approaches.