Analysis and Interpretation of single cells sequencing data – part 2
August 25, 2021Information
- ID
- 6877
- To Cite
- DCA Citation Guide
Transcript
- 00:00Yeah.
- 00:03OK, so today's second part of our
- 00:06travel through the analysis of a
- 00:09single cell RNA seek data processing.
- 00:11Last time we started with the defining how
- 00:14single cell RNA sequencing works and the
- 00:18differences between different protocols.
- 00:20For example, coverage on jeans,
- 00:22how they isolate cells and so on.
- 00:25Today we we deal more with the with
- 00:28the real analysis of the data.
- 00:31So last time we arrived.
- 00:33At the point where we saw,
- 00:36these are starting steps,
- 00:37we saw that from the molecular point of view,
- 00:41this strategy is to link the original
- 00:44RNA molecule with an oligo nucleotide
- 00:46called the cell barcode that allow us to
- 00:50identify the cell of origin of the RNA
- 00:52and then another important part is the UMIA.
- 00:56Molecular identifier that is a
- 00:58random nucleotide that allow us to
- 01:01correct for amplification biases,
- 01:03so to keep only those duplicate
- 01:06reads that are belonging to
- 01:08different molecules in our cells,
- 01:11and we're not amplified during the PCR.
- 01:15So after this steps away and after the
- 01:18mapping we also cover these last time and
- 01:21we're sorry. I'm question yes.
- 01:24How can you have the same UMI
- 01:27and two different RNAs? Oh, I see
- 01:302015. If you have the same,
- 01:32Umm I you collapse the reader.
- 01:34So if the read is the same,
- 01:36the UMI is the same and the
- 01:38cell barcode is the same.
- 01:40You collapse the read
- 01:41and you know I see I see
- 01:43the one I'm looking at I see so you could
- 01:46have the same cell barcode, the same.
- 01:49Umm I but a different sequence because
- 01:51you're in a different part of the same RNA.
- 01:55Well, in theory that depends on the
- 01:57protocol, because some of those are
- 01:58only three prime end and so this
- 02:00is I'm looking at numbers five and six there.
- 02:06Five and six other reads you mean, yeah?
- 02:09Well, uhm, yes in theory.
- 02:11But in theory, yes.
- 02:12So these would be a different
- 02:15RNA could be a different gene,
- 02:17but randomly they have the same.
- 02:19Umm, I yeah. So in theory it can happen.
- 02:24It depends on the length of the UMI,
- 02:27because they are randomly generated.
- 02:28For example,
- 02:29if you have if they are 12 nucleotide long,
- 02:32the probability to have two that are
- 02:35identical is for elevated at the 12th,
- 02:38so the longer they are the the lower
- 02:40is the probability to have two.
- 02:43Umm with the same sequence.
- 02:46OK, yeah.
- 02:49Uhm, OK, so you're my abstract.
- 02:51Their strategy used to reduce amplification
- 02:53biases in order to correct for that.
- 02:56And in single sat there important
- 02:58because the low because of the
- 03:00low material we start with that.
- 03:02That is the content there any content
- 03:04of a single cell and also the elevated
- 03:07number of amplification cycles that are
- 03:09necessary in order to amplify the signal.
- 03:12So after the mapping of the reason we
- 03:14arrived at these gene expression matrix
- 03:17where you have each column represents.
- 03:19One of the cell of our sample and each
- 03:22row is a gene and already last time we
- 03:26saw this fact that if you compare bulk
- 03:30versus single cell matrix at the single cell,
- 03:33one is has lower numbers,
- 03:35lower counts and that means that we
- 03:38have a higher potential contribution of
- 03:40noise and also we have a several zeros.
- 03:44So like 60 to 80% of all the values
- 03:48will be 0.
- 03:49And the problem is that many of these
- 03:52zeros are not biologically true,
- 03:54so it doesn't mean that the gene
- 03:57is not expressed in the cell,
- 03:59but they are technical because
- 04:01they were not detected during
- 04:03our any capturing approaches.
- 04:05So that's the main difference in terms
- 04:09of number with respect to bug RNA seek.
- 04:12So the first step said that we
- 04:15cover all the preprocessing steps.
- 04:17So after the digital account
- 04:19matrix arriving at that,
- 04:21try to remove a basically put cells
- 04:24that are potentially low of low
- 04:27quality and and also gene set that are
- 04:30potentially irrelevant for our analysis.
- 04:33So the first step in the preprocessing.
- 04:36Is that we want to remove?
- 04:39Empty droplets or dying cells, so it could.
- 04:43It could happen that during the
- 04:45preparation of our libraries,
- 04:47some cells,
- 04:48some droplets are empty or filled
- 04:51there with the cells that are dying.
- 04:55So usually I wait to spot these are is
- 04:58a quality of the data so that we can.
- 05:01What we can do is we can count the
- 05:04number of reads or the number of UM
- 05:07eyes that we detect in each cell.
- 05:10That's the sum of the number of
- 05:12unique reads that are aligned for
- 05:14each cell and we we can rank the
- 05:17cells from the the one with more.
- 05:19Umm I with the one with less you MI
- 05:22and we have this sort of distribution.
- 05:25And then we can decide to remove the
- 05:28bottom cells that you see here in red,
- 05:31the one the cells where the UMI
- 05:34is number is very low.
- 05:37So this is a onesie strategy to
- 05:39remove ourselves where we don't
- 05:40have coverage of many genes.
- 05:42We don't have a lot of reads,
- 05:44and likely the it's.
- 05:47South of something wrong
- 05:49during the preparation.
- 05:50For example,
- 05:51the IT was the droplet was
- 05:53slow or this cell was dying.
- 05:56Another way to capture dying to
- 05:58remove dying cells is that usually
- 06:01dying cells are associated with a
- 06:04high number of reads that mapped to
- 06:07mitochondrial genes so they have dying.
- 06:09Cells have extensive
- 06:11mitochondrial contamination.
- 06:12And so one can quantify the number of
- 06:15reads that map to mitochondrial genes.
- 06:17I think there are 40 genes in the human.
- 06:22Human cells, they are associated
- 06:24with the mitochondrial chromosome,
- 06:25and if these numbers, if the number of
- 06:29mitochondrial reads is less than 5%,
- 06:31then you keep the cell.
- 06:33If it's higher than that 10 or 20%,
- 06:36then you remove the entire set because
- 06:38there is a high probability that the
- 06:41these high numbers contamination is due
- 06:44to the fact that the cell was dying.
- 06:49Uhm then, on the other side
- 06:51we want also to remove.
- 06:53So all the technique is based on the
- 06:56fact that we isolate single cell.
- 06:58But sometimes this doesn't.
- 07:01Happen properly,
- 07:02so it means that it can happen that
- 07:05two cells share the same barcode or
- 07:07two cells were not physically separated,
- 07:10so they were included in the same droplet.
- 07:13For example, if we are using the droplet
- 07:16approach and so we want to identify
- 07:19possible doublets and remove those,
- 07:21so a double letter is a we define doublet as.
- 07:26A droplet or as a isolation
- 07:28not of one single cell,
- 07:31but of two or more sets.
- 07:33The most common event is that you have
- 07:36two cells included in the same droplet.
- 07:39So when you develop are they?
- 07:41Single cell techniques?
- 07:43Are there are experimental ways to
- 07:45evaluate the probability to have
- 07:47doubles and the approaches that we use?
- 07:50There are spacious mixing,
- 07:51so you combine for example population
- 07:54of human cells and mouse cells.
- 07:57And then use when you map the reads from
- 08:00each cell you see or you see how many
- 08:03for how many cells you have a double mapping.
- 08:06So how for how many cells some of
- 08:09your reads mapped to the human genome,
- 08:12some of your reads mapped to
- 08:14the mouse genome.
- 08:15You see here in this plot the mapping of the.
- 08:20The cells,
- 08:20so on the human transcript
- 08:22and on the mouse transcript.
- 08:24So all these cells are means that
- 08:27they contain only mouse a cells.
- 08:29Here they contain only human cells.
- 08:31What you see here is the identification
- 08:34of doublets,
- 08:35because here the content is mixed,
- 08:37you have something from mouse,
- 08:39something from human,
- 08:40and this is likely to be because
- 08:43one mouse and one human cells
- 08:45were included in the same droplet.
- 08:47So the comparison of these two.
- 08:50Plot is something to say that the
- 08:52probability to having doublets
- 08:54obviously depends on the concentration
- 08:55of your cells at the beginning.
- 08:58That's why, for example,
- 09:00here when you have 12.5 cells where we
- 09:03call it are you have very few events,
- 09:05only one droplet doublet event.
- 09:07When you increase the contrast
- 09:09concentration of cells probably
- 09:11increase the efficiency of sequencing.
- 09:12Are your single self because
- 09:14you have less empty droplet,
- 09:16but you also increase the
- 09:18probability you have doublets.
- 09:20That you see here.
- 09:21So the number here increase.
- 09:23So obviously this is possible.
- 09:25This evaluation is possible because
- 09:27you are mixing two species before,
- 09:30but it's not always feasible
- 09:32in our experiment,
- 09:33so we need to have a way to
- 09:35predict the possibility that a
- 09:38cell was not really a single cell,
- 09:41but it was a doublet,
- 09:43so there are computational approaches
- 09:45that try to evaluate for each of these
- 09:48cells that we obtain the possibility that.
- 09:51It's not really a single cell,
- 09:52but it's a doublet.
- 09:55So there are many progress,
- 09:58many,
- 09:58many procedures that are used
- 10:01at a common approach is these in
- 10:04silico simulation of tablets.
- 10:06This means that you have your
- 10:10matrix with digital counts
- 10:12with your cells.
- 10:14You simulate the doublet by
- 10:16selecting two random cells to
- 10:18random cells and combining them,
- 10:20meaning that for each of these two cells,
- 10:24you calculate the hypothetical
- 10:25cell that contains the sum of
- 10:28the reeds of the two cells.
- 10:30So this is an in silico tablet,
- 10:33so you generate thousands of these
- 10:36in silico tablets and you and the
- 10:40procedure is to mix these doubles
- 10:43together with the real cells.
- 10:45And so that they are analyzed together.
- 10:48So at some point of the
- 10:50analysis that we will see later,
- 10:52cells can be clustered together,
- 10:54and so for each of the original cell
- 10:57one can see how many in silico tablets
- 11:00are in the surrounding of the cell.
- 11:03So for each cell I can calculate how
- 11:06many neighbors in the neighborhood
- 11:08how many real cells there are and how
- 11:11many simulated tablets there are.
- 11:13And the principle is that the ratio.
- 11:15Between the simulated tablets and the
- 11:18real cells is a score that represents
- 11:21the possibility the probability of
- 11:23this cell to be a tablet itself.
- 11:26So the principle is that if my cell
- 11:29is surrounded by in silico tablets,
- 11:31then it's likely a tablet.
- 11:34If it's surrounded by if the
- 11:37tablets are all far from my cells,
- 11:39then probably these cells are not tablets.
- 11:44Was this step clear?
- 11:48Kind of sort of somehow
- 11:49you you teach it what a
- 11:51doublet looks like,
- 11:52and then it can find those things,
- 11:55or you teach it, but a double
- 11:57it looks like, and it says OK,
- 11:59I'm certain percentage should be
- 12:01doubled, yes, so you build the doublets,
- 12:03taking two random cells. After I got that
- 12:05part, I just don't understand how
- 12:07that helps you identify a real one.
- 12:11Yeah, so the idea is that yeah,
- 12:13yeah is is that real tablets will
- 12:15be surrounded by in silico tablets
- 12:18while a real cells will be far from
- 12:20the in silico tablets. OK OK, I have
- 12:23a related, maybe a related question
- 12:26'cause the idea of a doublet is
- 12:28that you have jeans from more than
- 12:30one cell that are being sequenced.
- 12:33We have this thing that happened
- 12:35and I'm I'm a basket in general,
- 12:37'cause I'm assuming it would be true
- 12:40for other people as well when we
- 12:42did parathyroid the cells that make
- 12:44parathyroid hormone have a humongous
- 12:46amount of PTH as their you know,
- 12:49main transcript.
- 12:49The ones that were negative ETH.
- 12:52All had some PTH,
- 12:53nothing on the order of like.
- 12:55Let's say we had 1000 for PTH.
- 12:57We'd have like three or one
- 12:59or two in the cells.
- 13:01That should have been negative.
- 13:03And it's hard to believe that
- 13:05every cell in the parathyroid
- 13:06actually has some RNA in it.
- 13:08For this parathyroid hormone,
- 13:09it's much more likely that the cell
- 13:12that looks like an endothelial
- 13:14cell really isn't endothelial
- 13:15cell in those three little reeds.
- 13:17We're wrong,
- 13:17but I don't know how that
- 13:19would have happened.
- 13:21Yeah, I don't know if that could
- 13:24be also like a contamination. Uhm?
- 13:28But if it's three instead of 3000, well,
- 13:30it that's a good signal to noise ratio.
- 13:33I would say I. I absolutely agree.
- 13:35I just thought there was maybe some
- 13:37general principle in single cells
- 13:38seek that we needed to look at,
- 13:40but that's not the case.
- 13:43No, the only things come coming to my
- 13:44mind is this for the possibility of it.
- 13:46Yeah there is. There are some.
- 13:49Possibility of supernatant
- 13:50contamination so that you get some
- 13:52early that is in the solution.
- 13:54For example, it could be something,
- 13:56especially if it's abundant,
- 13:57so it could be.
- 13:59Thank you that Diane maybe one other
- 14:02explanation for your finding is that.
- 14:04That those cells have some
- 14:06illegitimate transcription going on,
- 14:07and so you know that could be an explanation.
- 14:11Yes, absolutely, but that would be real.
- 14:13That would suggest that endothelial
- 14:15cells in the parathyroid like to
- 14:17turn on some parathyroid hormone,
- 14:19which would be a little weird.
- 14:21But the definition of illegitimate
- 14:23transcription is expression of any
- 14:25gene transcripted any cell type.
- 14:26I mean that's fine, but you know.
- 14:30Is that that parathyroid tissue
- 14:32that was sequenced is not adenoma,
- 14:34it's normal, it's abnormal, OK?
- 14:39Diane were those cells like washed
- 14:41before they were put on the sequencer?
- 14:43'cause maybe there's.
- 14:44Maybe somehow some transcripts
- 14:46are just leaking through.
- 14:47If there's a lot of them they were.
- 14:49Yeah, yeah, that gets back to maybe
- 14:51the contamination. I don't know.
- 14:53I thought that the machine washed the cell,
- 14:55but I don't know specifically I.
- 14:56I'm sure that the person that kind of goes
- 14:58through all this plumbing to get there,
- 15:00so it's a little surprising
- 15:01that would happen, but maybe.
- 15:04All right, moving on.
- 15:05I thought it was maybe something
- 15:07we all needed to know,
- 15:08but that it seems to be a specific problem.
- 15:10Sorry, cells were definitely
- 15:11washed before they went on.
- 15:17OK, moving on but thank you.
- 15:21Uh, OK, so the next step after this,
- 15:24so these were to remove cells that we
- 15:27didn't want in the following analysis.
- 15:30The next step is the normalization.
- 15:33So the normalization, as in any experiment,
- 15:36has the aim of removing systematic
- 15:39differences in the quantification
- 15:41of genes between cells.
- 15:42So we saw the methods that are
- 15:45used for the bulk RNA secret.
- 15:48So the simplest approach is the.
- 15:51Library size normalization so that each
- 15:53cell so the the signal for each from
- 15:56each cell is divided for the total sum
- 15:59of the Council of the number of reads
- 16:02or Umm I across all genes for each cell.
- 16:05So this is simplest approach.
- 16:07Normalization for the size
- 16:08of the library for each cell.
- 16:11The questionable assumption of
- 16:12these approaches is that you're
- 16:14assuming that each cell should
- 16:16have the same number of reads.
- 16:18This is problematic.
- 16:19It's problematic in the biker.
- 16:21I have a secret to assume that
- 16:24all your samples should have
- 16:26approximately the same number of RNA.
- 16:29It's even more, uh,
- 16:31questionable for the single cell because
- 16:33we know some cells depending on the
- 16:36cell type can have different number
- 16:39of model of RNA molecules depending
- 16:41on the translation and transcript.
- 16:44Depending on the transcription activities.
- 16:47So the alternative,
- 16:48the main alternatives that are
- 16:50used to this simplest approach,
- 16:52is to use a spike in RNA.
- 16:56Uhm,
- 16:56there are many benches,
- 16:58many kids of spiking RNAs that
- 17:00are now available,
- 17:01and the assumption for this for
- 17:04when normalizing for this begin
- 17:06is that inside each cell there is
- 17:08the same amount of spike in RNA's.
- 17:11And then this suggestion that
- 17:13the common suggestion in the.
- 17:16Approach is that it's better to use
- 17:19a single cell specific methods and
- 17:22it's better not to use the methods
- 17:25that are commonly used in the
- 17:27bike and they seek normalization.
- 17:29The reason for this is that the bulk
- 17:33methods do not take into consideration
- 17:36the fact that most of the values are zeros,
- 17:39and so using by chronic normalization
- 17:42methods could lead that tool
- 17:45very stranger size.
- 17:46Factors.
- 17:47So all these single set specific methods
- 17:50somehow take into consideration this
- 17:52problem of the excessive zeros and
- 17:54they use different strategies to normalize.
- 17:57So there are many methods
- 17:59for the single cell.
- 18:00Some of those consider instead
- 18:02of all the single cells pools of
- 18:05cells so that they normalize that
- 18:07not each single cell,
- 18:08but the normalized groups of cells
- 18:11where the content is summed up,
- 18:13and this somehow reduces the number of zeros.
- 18:16And then another methodology is try
- 18:19to correct to normalize differently
- 18:21for different groups of genes
- 18:23depending on whether they are low.
- 18:26They have low,
- 18:28medium or high expression levels.
- 18:31Uh, so the key point here is that,
- 18:34uh, as usual,
- 18:36the normalization choices affect the results,
- 18:38so this is taken from a paper
- 18:41published last year that was comparing
- 18:43a different normalization methods
- 18:46developed for single cell RNA seek data.
- 18:49So here this is a simple data
- 18:51set of mouse
- 18:52embryonic data where you
- 18:54have two population of cells.
- 18:57Then Veronica stem cells and the method.
- 19:01They are colored according to this,
- 19:03to the two, to the two
- 19:06populations they belong to.
- 19:07So what you see is the result
- 19:10without normalization at all.
- 19:11So it seems to work quite fine
- 19:14even without normalizing it.
- 19:15All this simple normalization here is
- 19:18the library size normalization and also
- 19:21this seems to be working quite fine
- 19:23except for this cell here and then.
- 19:25You see six different methods that
- 19:28were developed only for single cell
- 19:30RNA seek and their divided in.
- 19:32Two groups are based on the fact that
- 19:36they require spiking RNAs to worker,
- 19:38and this is basic German Sam STRT or
- 19:41they do not require speaking RNAs.
- 19:44So the general message here is
- 19:47that depending of these methods,
- 19:49the separation of this population
- 19:51change a lot and different methods,
- 19:54so there is no method that works
- 19:58better for each data set.
- 20:00So I would say it's usually important to
- 20:03to try different methods and depending
- 20:06on whether you have spikings or not,
- 20:10the possibility are either limited or not.
- 20:15This is another excellent yes,
- 20:16sorry so it's alright.
- 20:18Yeah, it's actually quite
- 20:19interesting to see this result,
- 20:21you know, seems to the simple
- 20:24normalization is at best in this case.
- 20:26If simply just judging from how tight
- 20:29the same cell population is and how far
- 20:32away two distinct populations should be,
- 20:34but I would assume this is done by maybe
- 20:38something like a sort of a Euclidean
- 20:40distance based measurements, because
- 20:42if you simply normalize by library size.
- 20:44If you use a correlation,
- 20:46that wouldn't change anything, right?
- 20:47Because the correlation between the
- 20:49genes will still remain the same,
- 20:51or between cells will still remain
- 20:52the same regardless whether
- 20:54mobilized by library or not.
- 20:56Yeah, then here. So I didn't see it.
- 20:59So this anticipation.
- 21:00So here the visualization
- 21:02of this cluster is based on
- 21:05this approach of dimensional
- 21:06reduction that is called Disney.
- 21:08So that could also affect.
- 21:11So these differences that you see here
- 21:13change also if you change the dimension.
- 21:17If you change the dimensionality
- 21:19reduction method.
- 21:22Used it to plot the results, but I agree,
- 21:25so here is the simple normalization.
- 21:26Seems to be one of the most
- 21:29effective in terms of separating
- 21:31the two clusters at least.
- 21:33This is another example with another
- 21:35data set of mouth longevity, real cells.
- 21:38So here you have more cluster of
- 21:40cells corresponding to different.
- 21:45Differentiation points so
- 21:47different stages of the embryo,
- 21:49a 14 and 1618 and then and
- 21:52then the green are the adults
- 21:55that cells epithelial cells,
- 21:57so also here they called the basic A
- 22:01take home message is that there is no
- 22:05consensus on which method is best,
- 22:08and different methods can
- 22:10lead to different results.
- 22:16So that in each, depending on the data set,
- 22:19the methods that perform the best changes.
- 22:25And what you don't have here is a
- 22:27methods that are taken from the back,
- 22:29so they were not considered in
- 22:31this in this comparison here.
- 22:36OK, so this was for the preprocessing steps.
- 22:39Then the post processing the steps.
- 22:41At this the the after we have the
- 22:44normalized our normalized data.
- 22:45We can start the second part of
- 22:48the analysis and the main steps
- 22:50here are the dimensional reduction.
- 22:52So we will see that these data since
- 22:55they have a lot of rows and columns
- 22:57there they have a high dimensionality.
- 23:00This is problematic for the
- 23:02code for the interpretation.
- 23:04For the visualization and also for the,
- 23:07uh, uh, running a computational
- 23:10procedures because it can take
- 23:13it can take a lot of time,
- 23:15so the reduction to a medium
- 23:18dimensional space is usually performed
- 23:20or performed on the genes so that
- 23:23instead of having a 10,000 genes
- 23:26that we have at this point we have
- 23:291030 dimensions and we will see that
- 23:32these dimensions can represent.
- 23:34Combination of different genes.
- 23:36But the key point is that you reduce the
- 23:41number of dimensions from 10,000 to 10.
- 23:45So this is the first important step,
- 23:48so I will speak about this quite in detail.
- 23:52So the problem is this curse of
- 23:54dimensionality so that we have 2010
- 23:57to 20,000 genes as features and
- 23:59depending on our experiment we have
- 24:0210,000 up to 1,000,000 of cells that
- 24:05we want to analyze and to consider.
- 24:07So we need to reduce the number of features,
- 24:11in particular the number of genes,
- 24:14the rational.
- 24:14Is that there are two two points
- 24:17for the rational.
- 24:19The first is that not all the
- 24:21genes are important.
- 24:23If our aim is to classify cells according
- 24:25to their differences in expression,
- 24:28not all genes are important.
- 24:30So for example for sure genes that are
- 24:33never expressed are not important,
- 24:35but also housekeeping genes that are
- 24:38always expressed at the same level
- 24:41are also not important in separating
- 24:43the cells and we select these.
- 24:45Jeans for this point that through
- 24:48feature gene selection,
- 24:50then the second point is that many
- 24:53genes are correlated in expression,
- 24:55so it's redundant to have two
- 24:57genes that are highly correlated
- 24:59as two separate information.
- 25:02We can combine them into one dimension.
- 25:07And these correlation is taking care during
- 25:10the dimensionality reduction approaches.
- 25:12So for this election,
- 25:14for the first step,
- 25:15selection of genes that are important.
- 25:20The aim is to select the genes that
- 25:22contain useful information about the
- 25:24biology of the system and so they
- 25:26are the genes that have difference in
- 25:29expression between different cells
- 25:30and we want to remove genes that
- 25:33contain either only noise because
- 25:35they have low expression level and
- 25:37so all the variation is noise or the
- 25:40jeans that do not have variation among genes.
- 25:42So the housekeeping genes.
- 25:44And the simplest approach to do
- 25:47that is to calculate for each gene
- 25:50a sort of measure that is a variance
- 25:54corrected for the mean.
- 25:56So we have seen something similar.
- 25:58Also during the lesson for the bulk RNA seek,
- 26:02because the approach is not so different.
- 26:05So you rank GS, you build a model.
- 26:09Each dot. Here is a gene,
- 26:11and you expect the variance of the gene.
- 26:15To be proportional to the
- 26:17average expression of the gene,
- 26:19meaning that the more the gene
- 26:21is expressed that the more random
- 26:24fluctuation fluctuation you also expect.
- 26:26So you build a sort of model that
- 26:29captures a random variations
- 26:31that you expect in your genes,
- 26:33and then you see which genes are outliers
- 26:36so they show more variance than the
- 26:40baseline variance that is based on the
- 26:43noise or or on the random variation.
- 26:45In expression and those genes that
- 26:48are highly variants are the ones that
- 26:50you select for further analysis,
- 26:52because there are those genes where you
- 26:55don't have only technical variations
- 26:57or but you have biological variation.
- 27:00The questionable assumption here is
- 27:02that the biological variability is
- 27:05higher than the technical variability,
- 27:07because the assumption here is that
- 27:10all these outlier genes that show
- 27:13higher variance than the average are
- 27:15important because this higher variance
- 27:18is biological variance and obviously
- 27:20also here as in some balcony approach,
- 27:24you could have some methods that
- 27:26penalize jeans having high variance,
- 27:29but lo Mein.
- 27:30Because you don't trust them so much,
- 27:32but the assumption is that you
- 27:34calculate a measure of variance and
- 27:37you consider the top variant genes.
- 27:39And you remove the others from the analysis.
- 27:45Then there is the dimensionality reduction,
- 27:47so this is a family of approaches
- 27:49that are using complex data.
- 27:51To reduce the number of dimensions of
- 27:55the data so this has a double purpose,
- 27:58as I say that to help the analysis
- 28:02downstream analysis because the
- 28:04reducing the dimension speed the
- 28:06calculation times and also to
- 28:09help the visually visualization.
- 28:11Especially when you report single
- 28:13cell data they need is to show data
- 28:17in simple and interpretable output.
- 28:19So usually this is a 2D plot.
- 28:22And so they mentioned I did.
- 28:24Action are also used in order to
- 28:27compress high dimensional information
- 28:29so that it can be presented in a 2D
- 28:32plot and the two are different needs.
- 28:35There are multiple methodologies.
- 28:36Each one has different advantages
- 28:38and limitations,
- 28:39so the classic example of a dimensional
- 28:42reduction that we always have in mind
- 28:45and possibly historically speaking,
- 28:47is one of the oldest is when you have a
- 28:51problem to draw a 2D map of the Earth.
- 28:55So Earth is 3D and you want that
- 28:572D map that keeps most of the
- 29:01reliable information on the geography
- 29:03on the geography of Earth.
- 29:05So where the continents are placed,
- 29:08their shapes, their areas and so on,
- 29:11so there are different approaches,
- 29:13many different approaches to convert
- 29:15the 3D map of Earth into 2D maps,
- 29:19and for example here you see one
- 29:21of the most famous projections that
- 29:24is called Mercator projections.
- 29:26So this was developed in the 16th century,
- 29:29and it's the one used by sailors because
- 29:32it keeps the directions and shapes.
- 29:36So it's a good map to know
- 29:38what is North East or West.
- 29:40And the problem is that there is a high
- 29:43distortion in this map of areas so that
- 29:45the the more you are far from the equator,
- 29:48the more areas that seems
- 29:50larger than they are.
- 29:51And so for example here it seems
- 29:53that Greenland is bigger than
- 29:55the whole of South America.
- 29:57That is not so. This is the distortion.
- 29:59So other projections such as these two.
- 30:03And they are projections where
- 30:05the the area is preserved,
- 30:07and so that this area corresponds really
- 30:09to the smallest areas with respect.
- 30:12For example to South America.
- 30:14But these kind of maps do not
- 30:16preserve shapes and direction,
- 30:18so that the common point is is any
- 30:21projection will be will distort
- 30:24the some of the features.
- 30:26So reduction of dimensionality is
- 30:28always an approximation and it brings it
- 30:31brings some distortions and deviations.
- 30:33And as a for the Earth map,
- 30:35we have different approaches also
- 30:37for our single cell data we see
- 30:40there are different techniques.
- 30:44Is this clear?
- 30:49Anyway, it's a very good analogy,
- 30:50so this is awesome feedback.
- 30:54So the first one that we will see with a real
- 30:57example is principal component analysis.
- 30:59So in our case we are studying cells
- 31:01based on the expression of genes.
- 31:03So in that in our simple example
- 31:05we will have six cells and since
- 31:07they are simple cells they have
- 31:09only they express or women age 25,
- 31:11only four genes.
- 31:12And So what you see here is the
- 31:15expression level of each gene
- 31:17from A to D in this six sets.
- 31:20So now we can use the expression levels,
- 31:23uh, so in in as a way to map cells
- 31:26and the expression level of each
- 31:29gene is a different dimension.
- 31:31So in this case we have a four
- 31:35dimensional space that obviously
- 31:36we cannot plot a on a 2D plot.
- 31:39So one simple so we could plot
- 31:41on a two diploid cells based on
- 31:44the expression of two genes,
- 31:47and so we can take jeanae engine be.
- 31:50And build these sort of map of
- 31:53these cells based on the expression
- 31:55level of gene eight that is our X
- 31:58axis and gene B that is our Y axis.
- 32:01And here you see where cells are
- 32:04located according to the expression
- 32:06of eight of these two genes.
- 32:09So the expression of each gene
- 32:11is a dimension.
- 32:13So now with this weekend plot,
- 32:15two genes in a 2D map,
- 32:18and so for performing
- 32:20principal component analysis,
- 32:21what is usually done at the beginning
- 32:24is to center the measurement,
- 32:26meaning that these genes here
- 32:29they have an average expression
- 32:31of seven for these jeans.
- 32:33This is the average of jinei
- 32:36across the six cells.
- 32:38And so on.
- 32:39So these gene B has an average
- 32:42of 4.5 GC of six and so on.
- 32:44So centering the data means that
- 32:47you calculated the mean expression
- 32:49of the gene across all these cells
- 32:51and you subtract the mean from all
- 32:54the values of the gene so that you.
- 32:57Switch from this matrix that is
- 32:59not centered to this matrix that
- 33:02is centered around 0.
- 33:03So I simply from the top
- 33:05row I subtracted seven,
- 33:07so 11 -- 7 is 4 and so on.
- 33:10From the second I subtract 4.5 and
- 33:12so on so that you see in the centered
- 33:16values are also negative and the common
- 33:18point is that the mean for each gene is 0.
- 33:23So usually before performing
- 33:25like I mentioned,
- 33:26I did action.
- 33:27This centering is is performed and
- 33:29it's also helpful in the visualization.
- 33:32So before centering,
- 33:33the cells were were looking like this.
- 33:35After the centering,
- 33:37these are the new coordinates,
- 33:38so nothing changed that it's only the
- 33:41origin of the axis and the position of
- 33:43the zero that are that are different.
- 33:46But if you look at the cells,
- 33:48the points are exactly in the
- 33:51same position as before.
- 33:53No one question that we can
- 33:55ask here is weather.
- 33:57So what you see here is that the
- 34:00difference here of the cells.
- 34:02We can capture the difference of
- 34:04these cells because they differ
- 34:06in the expression of A&B and one
- 34:09question we can ask is whether it
- 34:11is better to weather GTA or gene bit
- 34:14is better in separating these cells.
- 34:16So this corresponds to asking
- 34:18how much of the variability of
- 34:21the data is associated with this.
- 34:23Progression of GD or with
- 34:25the expression of Gene D.
- 34:28And so the question is,
- 34:30what is the variation of these six
- 34:33points that is associated with gene
- 34:37expression engine B expression?
- 34:39So there is a simple way to calculate
- 34:42the variation associated that
- 34:44corresponds to the formula of the variance.
- 34:48So this is an example to calculate the
- 34:51variation that is associated with gene 8.
- 34:54So here we're considering the X axis here,
- 34:58so I can draw a projection from each
- 35:01cell to these axis and calculate the
- 35:05distance from the origin to each cell.
- 35:08And basically,
- 35:09since here where we centered the data,
- 35:12the distance basically correspond
- 35:13to these expression level.
- 35:15So cell one has a distance of four cell,
- 35:18two or distance of five and so on.
- 35:22Now if we want to measure the
- 35:25variation with the variance formula,
- 35:27the variance formula is to take
- 35:29the square of each of the distance
- 35:32of these six distances.
- 35:34Some disc wears and then divide everything
- 35:37by the number of observation minus one.
- 35:40So this is how we calculate the
- 35:43variance of the expression of GD.
- 35:45So the formula here is the following.
- 35:48So we take the six distances,
- 35:51we square the distances.
- 35:52With some the results and we divide by 5.
- 35:56So the variance of jinei is 30.8.
- 36:00We can do the same for jinbe.
- 36:03And in order to have the variance
- 36:06associated with Gene B Now looking
- 36:09at this blotter A and C,
- 36:11it seems by I that gene A has
- 36:14more differences,
- 36:14higher variance than Gene B,
- 36:16and you can see these also by
- 36:19looking at the range of the axis
- 36:22minus 6 to 6 -- 4 -- 4 to four.
- 36:26So we can so the variance of GA is 30.8.
- 36:30In this case,
- 36:31the variance of gene B is less in this.
- 36:34Cases in 8.3.
- 36:37So the calculation of the
- 36:39variance of Steam V is the same,
- 36:41but instead of projecting
- 36:43cells on the X axis,
- 36:44we project cells on the Y axis and
- 36:47that's how I come with these results.
- 36:50Now we can see that if we consider
- 36:52the global variance of our data
- 36:54along these two dimensions,
- 36:56we can say that.
- 36:59That that the expression of gene A
- 37:02contains 80% of the global variance.
- 37:05And the expression of Gene B
- 37:08contains 20% approximately of the
- 37:11whole variance where the whole the
- 37:15whole variance is just 30.8 + 8.3.
- 37:18So if now I have to select only one
- 37:21of these dimension based on the on the
- 37:24fact that variation is information,
- 37:26I would select Jeannie.
- 37:27So if I have to drop one of the
- 37:30genes I would drop Gene B because
- 37:32it contains less information,
- 37:35less variance than Jenny.
- 37:38Now the question for PCA is
- 37:41whether it is whether is there a
- 37:44line that is not jeanae origin B.
- 37:47It's not one of these that captures
- 37:51more variation that maximizes
- 37:53the variation that is captured.
- 37:56So the question is to try to calculate
- 38:00the variance that is associated
- 38:02with each of these possible lines
- 38:05in the same way as we did here.
- 38:08But the changing the line and
- 38:11so changing this calculation.
- 38:13So this is a problem of like minimization
- 38:17of the distances or maximization of the.
- 38:21Of the various.
- 38:23And so we can find that
- 38:26among all the possibilities,
- 38:27the line that maximizes the
- 38:29variance for our data.
- 38:31In this case this is the line.
- 38:35That maximizes the variance,
- 38:37and basically what we found is the
- 38:40principal component want of our data.
- 38:43So principal component principal
- 38:45component one is exactly that the
- 38:48that the dimension that maximizes
- 38:50the variance of data with respect
- 38:52to all the other possibilities
- 38:55toward the other possible lines.
- 38:57In this case that cross the origin.
- 39:02Now once we identify PC one PC two,
- 39:05so the second principle component
- 39:06is the line that is orthogonal to
- 39:09the first step and this is easy
- 39:11because we are in a case where
- 39:13we have only two dimension so.
- 39:16The second principle component
- 39:18is simply the the line that is
- 39:20orthogonal to the principal component,
- 39:23one that we found.
- 39:24So once we identify this principal component,
- 39:27now we can represent our data
- 39:29not from the point of view of our
- 39:32original jeans of the expression
- 39:34of our original jeans,
- 39:35but from the point of view of
- 39:38a principal component want and
- 39:40principal component tool.
- 39:41So this means that we are rotating the data.
- 39:45In this way,
- 39:46so that now our reference system
- 39:49system of reference is not given by
- 39:52our regional expression but by PC1 and PC2.
- 39:57But the data are always dissing.
- 40:00They didn't change their respective
- 40:02localization, so we just rotated the data.
- 40:05Now the advantage of doing this is
- 40:08that now if we calculate the variance
- 40:12associated with PC one and PC two,
- 40:15we can see that a difference with respect
- 40:19to our original to our regional dimensions.
- 40:22So we can see that PCA captures almost 100%.
- 40:27Of the variance of our data
- 40:30while PC two captures much less.
- 40:33And this is because.
- 40:35Exactly because PC one was
- 40:38selected because it was maximising
- 40:40my maximising this value here.
- 40:43So and here you see the difference
- 40:45between the variance with the
- 40:47original dimension gene and gene B,
- 40:49and with the new principal components.
- 40:51So the advantage of the technique is that
- 40:53now if I want to drop one of the dimension.
- 40:57So if we want to pass from 2
- 40:59dimensions to one dimension,
- 41:01if I select PC one,
- 41:02I lose a less than 5% of the information,
- 41:05while with the original gene and gene B,
- 41:08if I choose ginae I had to lose
- 41:1120% of the information in this way.
- 41:13I reduced dimension.
- 41:14It can reduce the dimension from 2
- 41:17dimensions to one that keeping almost
- 41:20all of the information of the data.
- 41:22And this is the trick used by
- 41:26principal component analysis.
- 41:27Ah, so.
- 41:28This is a more complex example,
- 41:31so this was an example with four dimensions.
- 41:33If you remember our regional,
- 41:35our original table was with four genes,
- 41:37so we can do the same with four jeans.
- 41:40With four dimensions we can calculate
- 41:43the original variance associated
- 41:44with each of the original jeans.
- 41:46So Gene age in BC and D expressed as
- 41:48a percentage of the entire variance.
- 41:51And again,
- 41:51if I had to choose the two genes
- 41:54containing most of the variance,
- 41:56I would choose jeanae engine, see.
- 41:58But still I would lose.
- 42:0010% of the variance associated with
- 42:03Gene B and 20% associated with Gene D.
- 42:06Like if I perform principal
- 42:08component transformation,
- 42:09I found I find four principal
- 42:11components in a way that the first
- 42:14step maximizes the explained variance.
- 42:17II is orthogonal to the first,
- 42:20and maximising maximizes the
- 42:21residual variance and and so on.
- 42:24So the advantages that now if I consider
- 42:27these two components and they remove it.
- 42:30These two I only lose that like 3 to 4%
- 42:34of the variance and I can keep more than 90%.
- 42:39While here I could keep
- 42:41only 70% of the variance.
- 42:44And if I consider only these two
- 42:47dimensions and I plot my data,
- 42:50my cells here, I can obtain this plotter.
- 42:54So these are are the original cells
- 42:56based on the expression of these
- 42:59four genes plotted in the first two
- 43:02principal component where dimension
- 43:04one explains 74% of the variance
- 43:07and dimension to explains 23%.
- 43:09This corresponds to these values here.
- 43:14And the advantagous PCA. So this.
- 43:16So again,
- 43:16the trick was to reduce the space
- 43:19from four to two dimensions,
- 43:21but keeping most of the information.
- 43:24And so they they new dimensions,
- 43:26dimension, PC one and PC two are
- 43:29combinations of linear combinations of
- 43:31the old dimensions and the advantage
- 43:33of PCA is that I can easily calculate
- 43:36how much the expression of the original
- 43:39jeans is important in each of the
- 43:41newly found principal components.
- 43:43For example, in a plot like this.
- 43:46And this is the this is a plot that shows
- 43:50that principal component one captures
- 43:52a lot of the expression of Gene 8.
- 43:55B&C while Gindi is not very
- 43:59important in principal component
- 44:01one while principal components,
- 44:04who is mainly capturing
- 44:06the expression of Gene D.
- 44:10And in this example,
- 44:11a explanation of this is if you look
- 44:14at the original values that gene
- 44:16AB&C are very highly correlated,
- 44:18so they're highly expressed in
- 44:20the first three cells and low.
- 44:22They have low expression in the
- 44:25in the four to the 6th cells,
- 44:27while gindi is a little bit
- 44:29different because gene is highly
- 44:31expressed in cell 24 and five and
- 44:34low expression in 1/2 and three.
- 44:36So this means that Gene D is not correlated.
- 44:40With the expression of the other genes,
- 44:42so that's why using PCA I can capture
- 44:45the correlated expression of these three
- 44:47genes in the first principal component.
- 44:50And and the the.
- 44:52Uhm, expression of Gene D that is
- 44:56different and not correlated with the other.
- 44:59Using the second component.
- 45:01The second dimension,
- 45:03obviously in the real case scenario
- 45:06we start from,
- 45:07if we start from 3000 genes,
- 45:10we start from 3000 of dimensions.
- 45:13But if you look at PCA
- 45:16plots up sometimes you can.
- 45:19You can always find also
- 45:21the percentage of variance.
- 45:23That is explained from each dimension
- 45:26so you can see how much of the entire
- 45:30information of the data can be
- 45:33explained only using two dimensions
- 45:35and how much you are missing.
- 45:40Uh, no PCA is, uh,
- 45:42it was worth explaining because he's
- 45:44still one of the most used at techniques.
- 45:47Also in single cell data analysis.
- 45:49But you don't see PCA offer in the
- 45:52visualization of single cell data.
- 45:54And that's because the principal
- 45:56component analysis, as I said,
- 45:58has the advantage of being highly
- 46:01interpretable because from the
- 46:03components I can go back quite
- 46:05easily to the to the original jeans,
- 46:07so I can establish which genes are important
- 46:11are important in each of the dimensions.
- 46:13It is computationally efficient,
- 46:15but when I want to visualize a
- 46:18single cell RNA seek data, it's.
- 46:22It's lesser it's not very
- 46:24appealing to the eye,
- 46:25so and the reason for this is again,
- 46:28that the data in single cell are nonlinear.
- 46:31They have an excess of zeros,
- 46:33and so if you plot the principle,
- 46:35the first two principal components,
- 46:37often you don't have a clear
- 46:39separation of cells,
- 46:40and that's what you want to show,
- 46:43especially if you want if you're generating.
- 46:47Figure that is going to represent your data.
- 46:50So for this reason,
- 46:52mainly for the visualization,
- 46:53not for the analysis of the data,
- 46:56come in the first year of single
- 46:59cell analysis of the most employed
- 47:02approach was called the Disney,
- 47:04so it's at least a caustic
- 47:06neighborhood embedding.
- 47:07So this approach is not linear,
- 47:10as principal component is
- 47:12based on graph methods.
- 47:13So on this I will not spend.
- 47:17A lot in explaining how it works,
- 47:20but basically it's a random procedure and
- 47:24being nonlinear it means that it corrects.
- 47:28The original data using the
- 47:30nonlinear equation equation and
- 47:32the advantage is that it's better
- 47:34in showing clusters of cells,
- 47:36so that's it.
- 47:37It's able to retain the local
- 47:40structure of the data in low dimension
- 47:43where the low structure local
- 47:45structure means cluster of cells
- 47:47that are very similar to each other.
- 47:50The disadvantage is that it's a
- 47:53stochastic method so that each iteration
- 47:56can produce a different result.
- 47:58That's not true for PCA.
- 47:59It has a long time to run,
- 48:01especially when you increase the number
- 48:03of cells and it's considered to be bad then.
- 48:06In keeping the global structure of the data,
- 48:10and I have an example of design,
- 48:13so this is a data set with a.
- 48:16I think it's balcony seeker samples
- 48:18from different cancer from the.
- 48:20So each color here is a sample
- 48:23from a different cancer type and
- 48:26then the same data where run twice
- 48:29with the Disney approach and these
- 48:31are the two outputs so you can see
- 48:35that that something is conserved.
- 48:37Between the two run.
- 48:38Uh, so the number of cluster and they come,
- 48:42the size of the class and probably they they.
- 48:45The assignment of each sample to
- 48:47each cluster has been conserved.
- 48:49Them and also the shape of this
- 48:51single clastres somehow closer.
- 48:53But if you look at the organization of
- 48:56the whole cluster of the of the classes,
- 48:59that is different.
- 49:00For example,
- 49:00these orange cluster here in
- 49:02run one is in the
- 49:04middle and while uh and and green
- 49:07is opposite to read the wild.
- 49:09Red and green are very near to each other,
- 49:13so for the capturing the class time,
- 49:16visualizing the class set
- 49:17this method it's good.
- 49:19But then if I start the interpreting
- 49:21the distance between different clusters,
- 49:24these methods is not any more valuable,
- 49:26so it's not reliable because it depending
- 49:29on the initial random step of the analysis,
- 49:32it could lead to different maps
- 49:34and that's the main reason.
- 49:36Yeah, I'm sorry, yeah.
- 49:39I was just wondering so like is there
- 49:41any value in like sort of running the
- 49:44program like a bunch of times, right?
- 49:46Like just iteratively and then
- 49:47taking the average of the distances?
- 49:49There's some truth that emerges there
- 49:51when you like repeat it a whole bunch
- 49:53of times or it's just not useful.
- 49:57Uhm, I don't think I can answer
- 50:00to that personally, so I wouldn't
- 50:03know the answer to this question.
- 50:07Uhm, I don't know if anyone tried,
- 50:10so there is a way to reproduce
- 50:12the analysis to performing so
- 50:14called of pseudorandom analysis,
- 50:16meaning random analysis.
- 50:17When you run a program is based on
- 50:20a seed that is a random number,
- 50:22but it can be kept.
- 50:24It can be remembered during the different
- 50:26iterations and if you keep the seed
- 50:29constant you can reproduce results,
- 50:31but that's not.
- 50:32So that's a way to keep
- 50:34consistent the program if you run.
- 50:36The program in different on
- 50:38different machines, for example,
- 50:40but I don't know if that.
- 50:45If this has been done,
- 50:46maybe so, but I don't know what
- 50:48would be the result of that.
- 50:50So to run it a lot of times
- 50:52and trying to capture a sort
- 50:53of stability in the distances,
- 50:55sure sure. So in the field,
- 50:59are the fact that if, for example,
- 51:02if you look at the publication so you can,
- 51:05you can like data the analysis according
- 51:08to the method that they used to visualize.
- 51:11So if you look at some plot and
- 51:13it's at Disney probability analysis
- 51:16is before a 2018 2018 because in
- 51:192018 the what is now the the the
- 51:22most used approach to visualize
- 51:24the single cell data has been
- 51:27presented and that's the you map.
- 51:29The human method.
- 51:30So if you see a plot that is uses as
- 51:34a dimensionality reduction technique,
- 51:36the human deaths from
- 51:382018 to now more or less.
- 51:42So the humor is also another
- 51:44question is also common, sort of, uh,
- 51:47so I've actually noticed that if you,
- 51:50for example Disney,
- 51:51if you just have a completely random data,
- 51:54so let's say, OK, I generate the a
- 51:58computer generated completely random data.
- 52:01Supposedly it will be a fuzzy
- 52:03ball in this PC plot,
- 52:04but if you do in the Disney it will
- 52:07become some kind of patterns you can
- 52:09start to see emerging from that.
- 52:11I just wonder for things like
- 52:13you map and other things. Is
- 52:15this also the same problem or
- 52:17it's a yeah it's the same problem
- 52:20so all these methods because all
- 52:22these methods like try to maximize.
- 52:25The separation of these objects and
- 52:27the the problem is that the boss 40s
- 52:30and also for your map you have noise.
- 52:33So if your differences mainly driven
- 52:35by noise, UM, they tried, they they.
- 52:37They basically create patterns from noise,
- 52:39yeah. And this is a less a problem
- 52:43with the PCA. That's right, yeah.
- 52:45So this is not solved by the human,
- 52:48So what it seems to be solved by
- 52:50the UMAP is mainly that it's faster.
- 52:52Uh, it's faster than his name,
- 52:55so it can be applied in a reasonable time.
- 52:58So when the data set is very high in
- 53:01terms of number of cells and also it
- 53:04seems to be better in a preserving
- 53:07this global structure of the data so
- 53:10that also the distances between the
- 53:12different clusters are more like reliable,
- 53:15I think there is actually a parameter
- 53:17where you kind of tune how much you are.
- 53:21You give weight to the.
- 53:23Local structure or to the global structure,
- 53:25but it's generally considered to be more
- 53:28reliable on the global structure of the data,
- 53:31so it's considered to be like a
- 53:33trade off a good tradeoff between
- 53:36the PCA and the Disney approach.
- 53:39Barca, it hasn't.
- 53:40So both these methods have problems.
- 53:42For example in interpret ability.
- 53:44So PCA is easy to go back to the
- 53:48original jeans in Disney and
- 53:50you map it's very problematic.
- 53:53And uhm,
- 53:53yeah.
- 53:54And also you map is a random
- 53:56so different runs,
- 53:58so give you slightly different results.
- 54:02Can I ask you another quick question?
- 54:05So when you say the difference in
- 54:07time for like processing the data,
- 54:10what is the scale of that time?
- 54:12Are you saying like hours or days?
- 54:16Ah well, it did, but it's,
- 54:18uh, well it diverted.
- 54:20Depending on the number of,
- 54:21uh, cells, so it could be
- 54:24that if you have 100 cells,
- 54:26you don't notice the difference,
- 54:28but scaling. So adding data you.
- 54:32You delete a lot, so for this day
- 54:34I know that there are a lot of
- 54:36variation of these names that have
- 54:38been working on the efficiency,
- 54:39so they are faster. Uh, but, uh.
- 54:42I guess it's also problem of memory,
- 54:46so personally I never run an analysis
- 54:48on a sample that was more than
- 54:5020 thirty thousand cells and so
- 54:53personally I don't know how problematic
- 54:55it is to work with Disney with a
- 54:58large data set of 1,000,000 cells.
- 55:00But the problem is that the
- 55:02more cells you are,
- 55:03the more you gain in time.
- 55:05Using you map against at least basic Disney.
- 55:09Sure. OK, so this is only technical.
- 55:13It's not really.
- 55:14So the the key point also is that depending
- 55:18on the dimensionality choice you make.
- 55:22There is an answer look different,
- 55:24so this is the same data set up,
- 55:27so normalization is the same.
- 55:29The input data was the same,
- 55:31it's from a mouse brain.
- 55:33So here you see some populations
- 55:35that correspond to neurons,
- 55:37different types of neurons and microglia,
- 55:39and solar cells.
- 55:40And you see the representation of these
- 55:42datasets using principal component analysis.
- 55:44So cells here are colored according
- 55:47to the cell type and and and you
- 55:50see that you can see the classes.
- 55:52But you but but for example,
- 55:54points within the same clusters.
- 55:56That kind of spread around,
- 55:58so that's why for the visualization
- 56:00of the cluster this is less
- 56:03clear than the other two methods.
- 56:05So these two methods try basically
- 56:08maximizes the completeness of
- 56:10the data inside the cluster.
- 56:12And using different two different approaches.
- 56:19And then again, for interpreting these data,
- 56:22it's more about just seeing
- 56:23like how cells are similar to
- 56:25each other within a cluster,
- 56:27like how they cluster separately
- 56:29as opposed to like the distances
- 56:31between the cluster being able
- 56:32to infer any relationship from
- 56:34that distance, right? Well,
- 56:36here you could be interested also well,
- 56:38for sure if you use Disney,
- 56:40the distances between them as you see here,
- 56:43the distances tend to be like.
- 56:46More or less the same,
- 56:48so their equally distributed button
- 56:50here in new map you can have a
- 56:54distances that are that have a
- 56:56range so low to high distances.
- 56:59So here it can be informative.
- 57:01But basically when you use
- 57:03this for the visualization,
- 57:05what you want to communicate is
- 57:08that you identify the clusters of
- 57:11different cells and you want to be
- 57:14able to see them where they are in.
- 57:17Which relation they are?
- 57:18How many cells belong to each cluster,
- 57:20and so on.
- 57:21So usually these then are annotated
- 57:23with the name of the cluster based
- 57:26on marker genes and so obviously
- 57:28these for the visualization are
- 57:29better if you want to label
- 57:32your clusters and then this.
- 57:33But PCA is still the common tool,
- 57:36one of the most common tools that
- 57:38are run in the downstream analysis,
- 57:40meaning that Disney new map I
- 57:42really used mainly for visualization
- 57:43of data but nothing else.
- 57:45Specially it isn't but PCA.
- 57:47Is the basis for the clustering and
- 57:50for the trajectory analysis and so on.
- 57:52So still all the pipelines many of the
- 57:55old departments are still using the PCA,
- 57:58it's just for the visualization that
- 58:00they use these alternative approaches.