Analysis and Interpretation of single cells sequencing data – part 2

Name: Analysis and Interpretation of single cells sequencing data – part 2
Uploaded: 2021-08-25T15:12:19.76Z
Duration: 58 min 2 s

August 25, 2021

Information

ID: 6877
To Cite: DCA Citation Guide

Download Transcript

00:00Yeah.
00:03OK, so today's second part of our
00:06travel through the analysis of a
00:09single cell RNA seek data processing.
00:11Last time we started with the defining how
00:14single cell RNA sequencing works and the
00:18differences between different protocols.
00:20For example, coverage on jeans,
00:22how they isolate cells and so on.
00:25Today we we deal more with the with
00:28the real analysis of the data.
00:31So last time we arrived.
00:33At the point where we saw,
00:36these are starting steps,
00:37we saw that from the molecular point of view,
00:41this strategy is to link the original
00:44RNA molecule with an oligo nucleotide
00:46called the cell barcode that allow us to
00:50identify the cell of origin of the RNA
00:52and then another important part is the UMIA.
00:56Molecular identifier that is a
00:58random nucleotide that allow us to
01:01correct for amplification biases,
01:03so to keep only those duplicate
01:06reads that are belonging to
01:08different molecules in our cells,
01:11and we're not amplified during the PCR.
01:15So after this steps away and after the
01:18mapping we also cover these last time and
01:21we're sorry. I'm question yes.
01:24How can you have the same UMI
01:27and two different RNAs? Oh, I see
01:302015. If you have the same,
01:32Umm I you collapse the reader.
01:34So if the read is the same,
01:36the UMI is the same and the
01:38cell barcode is the same.
01:40You collapse the read
01:41and you know I see I see
01:43the one I'm looking at I see so you could
01:46have the same cell barcode, the same.
01:49Umm I but a different sequence because
01:51you're in a different part of the same RNA.
01:55Well, in theory that depends on the
01:57protocol, because some of those are
01:58only three prime end and so this
02:00is I'm looking at numbers five and six there.
02:06Five and six other reads you mean, yeah?
02:09Well, uhm, yes in theory.
02:11But in theory, yes.
02:12So these would be a different
02:15RNA could be a different gene,
02:17but randomly they have the same.
02:19Umm, I yeah. So in theory it can happen.
02:24It depends on the length of the UMI,
02:27because they are randomly generated.
02:28For example,
02:29if you have if they are 12 nucleotide long,
02:32the probability to have two that are
02:35identical is for elevated at the 12th,
02:38so the longer they are the the lower
02:40is the probability to have two.
02:43Umm with the same sequence.
02:46OK, yeah.
02:49Uhm, OK, so you're my abstract.
02:51Their strategy used to reduce amplification
02:53biases in order to correct for that.
02:56And in single sat there important
02:58because the low because of the
03:00low material we start with that.
03:02That is the content there any content
03:04of a single cell and also the elevated
03:07number of amplification cycles that are
03:09necessary in order to amplify the signal.
03:12So after the mapping of the reason we
03:14arrived at these gene expression matrix
03:17where you have each column represents.
03:19One of the cell of our sample and each
03:22row is a gene and already last time we
03:26saw this fact that if you compare bulk
03:30versus single cell matrix at the single cell,
03:33one is has lower numbers,
03:35lower counts and that means that we
03:38have a higher potential contribution of
03:40noise and also we have a several zeros.
03:44So like 60 to 80% of all the values
03:48will be 0.
03:49And the problem is that many of these
03:52zeros are not biologically true,
03:54so it doesn't mean that the gene
03:57is not expressed in the cell,
03:59but they are technical because
04:01they were not detected during
04:03our any capturing approaches.
04:05So that's the main difference in terms
04:09of number with respect to bug RNA seek.
04:12So the first step said that we
04:15cover all the preprocessing steps.
04:17So after the digital account
04:19matrix arriving at that,
04:21try to remove a basically put cells
04:24that are potentially low of low
04:27quality and and also gene set that are
04:30potentially irrelevant for our analysis.
04:33So the first step in the preprocessing.
04:36Is that we want to remove?
04:39Empty droplets or dying cells, so it could.
04:43It could happen that during the
04:45preparation of our libraries,
04:47some cells,
04:48some droplets are empty or filled
04:51there with the cells that are dying.
04:55So usually I wait to spot these are is
04:58a quality of the data so that we can.
05:01What we can do is we can count the
05:04number of reads or the number of UM
05:07eyes that we detect in each cell.
05:10That's the sum of the number of
05:12unique reads that are aligned for
05:14each cell and we we can rank the
05:17cells from the the one with more.
05:19Umm I with the one with less you MI
05:22and we have this sort of distribution.
05:25And then we can decide to remove the
05:28bottom cells that you see here in red,
05:31the one the cells where the UMI
05:34is number is very low.
05:37So this is a onesie strategy to
05:39remove ourselves where we don't
05:40have coverage of many genes.
05:42We don't have a lot of reads,
05:44and likely the it's.
05:47South of something wrong
05:49during the preparation.
05:50For example,
05:51the IT was the droplet was
05:53slow or this cell was dying.
05:56Another way to capture dying to
05:58remove dying cells is that usually
06:01dying cells are associated with a
06:04high number of reads that mapped to
06:07mitochondrial genes so they have dying.
06:09Cells have extensive
06:11mitochondrial contamination.
06:12And so one can quantify the number of
06:15reads that map to mitochondrial genes.
06:17I think there are 40 genes in the human.
06:22Human cells, they are associated
06:24with the mitochondrial chromosome,
06:25and if these numbers, if the number of
06:29mitochondrial reads is less than 5%,
06:31then you keep the cell.
06:33If it's higher than that 10 or 20%,
06:36then you remove the entire set because
06:38there is a high probability that the
06:41these high numbers contamination is due
06:44to the fact that the cell was dying.
06:49Uhm then, on the other side
06:51we want also to remove.
06:53So all the technique is based on the
06:56fact that we isolate single cell.
06:58But sometimes this doesn't.
07:01Happen properly,
07:02so it means that it can happen that
07:05two cells share the same barcode or
07:07two cells were not physically separated,
07:10so they were included in the same droplet.
07:13For example, if we are using the droplet
07:16approach and so we want to identify
07:19possible doublets and remove those,
07:21so a double letter is a we define doublet as.
07:26A droplet or as a isolation
07:28not of one single cell,
07:31but of two or more sets.
07:33The most common event is that you have
07:36two cells included in the same droplet.
07:39So when you develop are they?
07:41Single cell techniques?
07:43Are there are experimental ways to
07:45evaluate the probability to have
07:47doubles and the approaches that we use?
07:50There are spacious mixing,
07:51so you combine for example population
07:54of human cells and mouse cells.
07:57And then use when you map the reads from
08:00each cell you see or you see how many
08:03for how many cells you have a double mapping.
08:06So how for how many cells some of
08:09your reads mapped to the human genome,
08:12some of your reads mapped to
08:14the mouse genome.
08:15You see here in this plot the mapping of the.
08:20The cells,
08:20so on the human transcript
08:22and on the mouse transcript.
08:24So all these cells are means that
08:27they contain only mouse a cells.
08:29Here they contain only human cells.
08:31What you see here is the identification
08:34of doublets,
08:35because here the content is mixed,
08:37you have something from mouse,
08:39something from human,
08:40and this is likely to be because
08:43one mouse and one human cells
08:45were included in the same droplet.
08:47So the comparison of these two.
08:50Plot is something to say that the
08:52probability to having doublets
08:54obviously depends on the concentration
08:55of your cells at the beginning.
08:58That's why, for example,
09:00here when you have 12.5 cells where we
09:03call it are you have very few events,
09:05only one droplet doublet event.
09:07When you increase the contrast
09:09concentration of cells probably
09:11increase the efficiency of sequencing.
09:12Are your single self because
09:14you have less empty droplet,
09:16but you also increase the
09:18probability you have doublets.
09:20That you see here.
09:21So the number here increase.
09:23So obviously this is possible.
09:25This evaluation is possible because
09:27you are mixing two species before,
09:30but it's not always feasible
09:32in our experiment,
09:33so we need to have a way to
09:35predict the possibility that a
09:38cell was not really a single cell,
09:41but it was a doublet,
09:43so there are computational approaches
09:45that try to evaluate for each of these
09:48cells that we obtain the possibility that.
09:51It's not really a single cell,
09:52but it's a doublet.
09:55So there are many progress,
09:58many,
09:58many procedures that are used
10:01at a common approach is these in
10:04silico simulation of tablets.
10:06This means that you have your
10:10matrix with digital counts
10:12with your cells.
10:14You simulate the doublet by
10:16selecting two random cells to
10:18random cells and combining them,
10:20meaning that for each of these two cells,
10:24you calculate the hypothetical
10:25cell that contains the sum of
10:28the reeds of the two cells.
10:30So this is an in silico tablet,
10:33so you generate thousands of these
10:36in silico tablets and you and the
10:40procedure is to mix these doubles
10:43together with the real cells.
10:45And so that they are analyzed together.
10:48So at some point of the
10:50analysis that we will see later,
10:52cells can be clustered together,
10:54and so for each of the original cell
10:57one can see how many in silico tablets
11:00are in the surrounding of the cell.
11:03So for each cell I can calculate how
11:06many neighbors in the neighborhood
11:08how many real cells there are and how
11:11many simulated tablets there are.
11:13And the principle is that the ratio.
11:15Between the simulated tablets and the
11:18real cells is a score that represents
11:21the possibility the probability of
11:23this cell to be a tablet itself.
11:26So the principle is that if my cell
11:29is surrounded by in silico tablets,
11:31then it's likely a tablet.
11:34If it's surrounded by if the
11:37tablets are all far from my cells,
11:39then probably these cells are not tablets.
11:44Was this step clear?
11:48Kind of sort of somehow
11:49you you teach it what a
11:51doublet looks like,
11:52and then it can find those things,
11:55or you teach it, but a double
11:57it looks like, and it says OK,
11:59I'm certain percentage should be
12:01doubled, yes, so you build the doublets,
12:03taking two random cells. After I got that
12:05part, I just don't understand how
12:07that helps you identify a real one.
12:11Yeah, so the idea is that yeah,
12:13yeah is is that real tablets will
12:15be surrounded by in silico tablets
12:18while a real cells will be far from
12:20the in silico tablets. OK OK, I have
12:23a related, maybe a related question
12:26'cause the idea of a doublet is
12:28that you have jeans from more than
12:30one cell that are being sequenced.
12:33We have this thing that happened
12:35and I'm I'm a basket in general,
12:37'cause I'm assuming it would be true
12:40for other people as well when we
12:42did parathyroid the cells that make
12:44parathyroid hormone have a humongous
12:46amount of PTH as their you know,
12:49main transcript.
12:49The ones that were negative ETH.
12:52All had some PTH,
12:53nothing on the order of like.
12:55Let's say we had 1000 for PTH.
12:57We'd have like three or one
12:59or two in the cells.
13:01That should have been negative.
13:03And it's hard to believe that
13:05every cell in the parathyroid
13:06actually has some RNA in it.
13:08For this parathyroid hormone,
13:09it's much more likely that the cell
13:12that looks like an endothelial
13:14cell really isn't endothelial
13:15cell in those three little reeds.
13:17We're wrong,
13:17but I don't know how that
13:19would have happened.
13:21Yeah, I don't know if that could
13:24be also like a contamination. Uhm?
13:28But if it's three instead of 3000, well,
13:30it that's a good signal to noise ratio.
13:33I would say I. I absolutely agree.
13:35I just thought there was maybe some
13:37general principle in single cells
13:38seek that we needed to look at,
13:40but that's not the case.
13:43No, the only things come coming to my
13:44mind is this for the possibility of it.
13:46Yeah there is. There are some.
13:49Possibility of supernatant
13:50contamination so that you get some
13:52early that is in the solution.
13:54For example, it could be something,
13:56especially if it's abundant,
13:57so it could be.
13:59Thank you that Diane maybe one other
14:02explanation for your finding is that.
14:04That those cells have some
14:06illegitimate transcription going on,
14:07and so you know that could be an explanation.
14:11Yes, absolutely, but that would be real.
14:13That would suggest that endothelial
14:15cells in the parathyroid like to
14:17turn on some parathyroid hormone,
14:19which would be a little weird.
14:21But the definition of illegitimate
14:23transcription is expression of any
14:25gene transcripted any cell type.
14:26I mean that's fine, but you know.
14:30Is that that parathyroid tissue
14:32that was sequenced is not adenoma,
14:34it's normal, it's abnormal, OK?
14:39Diane were those cells like washed
14:41before they were put on the sequencer?
14:43'cause maybe there's.
14:44Maybe somehow some transcripts
14:46are just leaking through.
14:47If there's a lot of them they were.
14:49Yeah, yeah, that gets back to maybe
14:51the contamination. I don't know.
14:53I thought that the machine washed the cell,
14:55but I don't know specifically I.
14:56I'm sure that the person that kind of goes
14:58through all this plumbing to get there,
15:00so it's a little surprising
15:01that would happen, but maybe.
15:04All right, moving on.
15:05I thought it was maybe something
15:07we all needed to know,
15:08but that it seems to be a specific problem.
15:10Sorry, cells were definitely
15:11washed before they went on.
15:17OK, moving on but thank you.
15:21Uh, OK, so the next step after this,
15:24so these were to remove cells that we
15:27didn't want in the following analysis.
15:30The next step is the normalization.
15:33So the normalization, as in any experiment,
15:36has the aim of removing systematic
15:39differences in the quantification
15:41of genes between cells.
15:42So we saw the methods that are
15:45used for the bulk RNA secret.
15:48So the simplest approach is the.
15:51Library size normalization so that each
15:53cell so the the signal for each from
15:56each cell is divided for the total sum
15:59of the Council of the number of reads
16:02or Umm I across all genes for each cell.
16:05So this is simplest approach.
16:07Normalization for the size
16:08of the library for each cell.
16:11The questionable assumption of
16:12these approaches is that you're
16:14assuming that each cell should
16:16have the same number of reads.
16:18This is problematic.
16:19It's problematic in the biker.
16:21I have a secret to assume that
16:24all your samples should have
16:26approximately the same number of RNA.
16:29It's even more, uh,
16:31questionable for the single cell because
16:33we know some cells depending on the
16:36cell type can have different number
16:39of model of RNA molecules depending
16:41on the translation and transcript.
16:44Depending on the transcription activities.
16:47So the alternative,
16:48the main alternatives that are
16:50used to this simplest approach,
16:52is to use a spike in RNA.
16:56Uhm,
16:56there are many benches,
16:58many kids of spiking RNAs that
17:00are now available,
17:01and the assumption for this for
17:04when normalizing for this begin
17:06is that inside each cell there is
17:08the same amount of spike in RNA's.
17:11And then this suggestion that
17:13the common suggestion in the.
17:16Approach is that it's better to use
17:19a single cell specific methods and
17:22it's better not to use the methods
17:25that are commonly used in the
17:27bike and they seek normalization.
17:29The reason for this is that the bulk
17:33methods do not take into consideration
17:36the fact that most of the values are zeros,
17:39and so using by chronic normalization
17:42methods could lead that tool
17:45very stranger size.
17:46Factors.
17:47So all these single set specific methods
17:50somehow take into consideration this
17:52problem of the excessive zeros and
17:54they use different strategies to normalize.
17:57So there are many methods
17:59for the single cell.
18:00Some of those consider instead
18:02of all the single cells pools of
18:05cells so that they normalize that
18:07not each single cell,
18:08but the normalized groups of cells
18:11where the content is summed up,
18:13and this somehow reduces the number of zeros.
18:16And then another methodology is try
18:19to correct to normalize differently
18:21for different groups of genes
18:23depending on whether they are low.
18:26They have low,
18:28medium or high expression levels.
18:31Uh, so the key point here is that,
18:34uh, as usual,
18:36the normalization choices affect the results,
18:38so this is taken from a paper
18:41published last year that was comparing
18:43a different normalization methods
18:46developed for single cell RNA seek data.
18:49So here this is a simple data
18:51set of mouse
18:52embryonic data where you
18:54have two population of cells.
18:57Then Veronica stem cells and the method.
19:01They are colored according to this,
19:03to the two, to the two
19:06populations they belong to.
19:07So what you see is the result
19:10without normalization at all.
19:11So it seems to work quite fine
19:14even without normalizing it.
19:15All this simple normalization here is
19:18the library size normalization and also
19:21this seems to be working quite fine
19:23except for this cell here and then.
19:25You see six different methods that
19:28were developed only for single cell
19:30RNA seek and their divided in.
19:32Two groups are based on the fact that
19:36they require spiking RNAs to worker,
19:38and this is basic German Sam STRT or
19:41they do not require speaking RNAs.
19:44So the general message here is
19:47that depending of these methods,
19:49the separation of this population
19:51change a lot and different methods,
19:54so there is no method that works
19:58better for each data set.
20:00So I would say it's usually important to
20:03to try different methods and depending
20:06on whether you have spikings or not,
20:10the possibility are either limited or not.
20:15This is another excellent yes,
20:16sorry so it's alright.
20:18Yeah, it's actually quite
20:19interesting to see this result,
20:21you know, seems to the simple
20:24normalization is at best in this case.
20:26If simply just judging from how tight
20:29the same cell population is and how far
20:32away two distinct populations should be,
20:34but I would assume this is done by maybe
20:38something like a sort of a Euclidean
20:40distance based measurements, because
20:42if you simply normalize by library size.
20:44If you use a correlation,
20:46that wouldn't change anything, right?
20:47Because the correlation between the
20:49genes will still remain the same,
20:51or between cells will still remain
20:52the same regardless whether
20:54mobilized by library or not.
20:56Yeah, then here. So I didn't see it.
20:59So this anticipation.
21:00So here the visualization
21:02of this cluster is based on
21:05this approach of dimensional
21:06reduction that is called Disney.
21:08So that could also affect.
21:11So these differences that you see here
21:13change also if you change the dimension.
21:17If you change the dimensionality
21:19reduction method.
21:22Used it to plot the results, but I agree,
21:25so here is the simple normalization.
21:26Seems to be one of the most
21:29effective in terms of separating
21:31the two clusters at least.
21:33This is another example with another
21:35data set of mouth longevity, real cells.
21:38So here you have more cluster of
21:40cells corresponding to different.
21:45Differentiation points so
21:47different stages of the embryo,
21:49a 14 and 1618 and then and
21:52then the green are the adults
21:55that cells epithelial cells,
21:57so also here they called the basic A
22:01take home message is that there is no
22:05consensus on which method is best,
22:08and different methods can
22:10lead to different results.
22:16So that in each, depending on the data set,
22:19the methods that perform the best changes.
22:25And what you don't have here is a
22:27methods that are taken from the back,
22:29so they were not considered in
22:31this in this comparison here.
22:36OK, so this was for the preprocessing steps.
22:39Then the post processing the steps.
22:41At this the the after we have the
22:44normalized our normalized data.
22:45We can start the second part of
22:48the analysis and the main steps
22:50here are the dimensional reduction.
22:52So we will see that these data since
22:55they have a lot of rows and columns
22:57there they have a high dimensionality.
23:00This is problematic for the
23:02code for the interpretation.
23:04For the visualization and also for the,
23:07uh, uh, running a computational
23:10procedures because it can take
23:13it can take a lot of time,
23:15so the reduction to a medium
23:18dimensional space is usually performed
23:20or performed on the genes so that
23:23instead of having a 10,000 genes
23:26that we have at this point we have
23:291030 dimensions and we will see that
23:32these dimensions can represent.
23:34Combination of different genes.
23:36But the key point is that you reduce the
23:41number of dimensions from 10,000 to 10.
23:45So this is the first important step,
23:48so I will speak about this quite in detail.
23:52So the problem is this curse of
23:54dimensionality so that we have 2010
23:57to 20,000 genes as features and
23:59depending on our experiment we have
24:0210,000 up to 1,000,000 of cells that
24:05we want to analyze and to consider.
24:07So we need to reduce the number of features,
24:11in particular the number of genes,
24:14the rational.
24:14Is that there are two two points
24:17for the rational.
24:19The first is that not all the
24:21genes are important.
24:23If our aim is to classify cells according
24:25to their differences in expression,
24:28not all genes are important.
24:30So for example for sure genes that are
24:33never expressed are not important,
24:35but also housekeeping genes that are
24:38always expressed at the same level
24:41are also not important in separating
24:43the cells and we select these.
24:45Jeans for this point that through
24:48feature gene selection,
24:50then the second point is that many
24:53genes are correlated in expression,
24:55so it's redundant to have two
24:57genes that are highly correlated
24:59as two separate information.
25:02We can combine them into one dimension.
25:07And these correlation is taking care during
25:10the dimensionality reduction approaches.
25:12So for this election,
25:14for the first step,
25:15selection of genes that are important.
25:20The aim is to select the genes that
25:22contain useful information about the
25:24biology of the system and so they
25:26are the genes that have difference in
25:29expression between different cells
25:30and we want to remove genes that
25:33contain either only noise because
25:35they have low expression level and
25:37so all the variation is noise or the
25:40jeans that do not have variation among genes.
25:42So the housekeeping genes.
25:44And the simplest approach to do
25:47that is to calculate for each gene
25:50a sort of measure that is a variance
25:54corrected for the mean.
25:56So we have seen something similar.
25:58Also during the lesson for the bulk RNA seek,
26:02because the approach is not so different.
26:05So you rank GS, you build a model.
26:09Each dot. Here is a gene,
26:11and you expect the variance of the gene.
26:15To be proportional to the
26:17average expression of the gene,
26:19meaning that the more the gene
26:21is expressed that the more random
26:24fluctuation fluctuation you also expect.
26:26So you build a sort of model that
26:29captures a random variations
26:31that you expect in your genes,
26:33and then you see which genes are outliers
26:36so they show more variance than the
26:40baseline variance that is based on the
26:43noise or or on the random variation.
26:45In expression and those genes that
26:48are highly variants are the ones that
26:50you select for further analysis,
26:52because there are those genes where you
26:55don't have only technical variations
26:57or but you have biological variation.
27:00The questionable assumption here is
27:02that the biological variability is
27:05higher than the technical variability,
27:07because the assumption here is that
27:10all these outlier genes that show
27:13higher variance than the average are
27:15important because this higher variance
27:18is biological variance and obviously
27:20also here as in some balcony approach,
27:24you could have some methods that
27:26penalize jeans having high variance,
27:29but lo Mein.
27:30Because you don't trust them so much,
27:32but the assumption is that you
27:34calculate a measure of variance and
27:37you consider the top variant genes.
27:39And you remove the others from the analysis.
27:45Then there is the dimensionality reduction,
27:47so this is a family of approaches
27:49that are using complex data.
27:51To reduce the number of dimensions of
27:55the data so this has a double purpose,
27:58as I say that to help the analysis
28:02downstream analysis because the
28:04reducing the dimension speed the
28:06calculation times and also to
28:09help the visually visualization.
28:11Especially when you report single
28:13cell data they need is to show data
28:17in simple and interpretable output.
28:19So usually this is a 2D plot.
28:22And so they mentioned I did.
28:24Action are also used in order to
28:27compress high dimensional information
28:29so that it can be presented in a 2D
28:32plot and the two are different needs.
28:35There are multiple methodologies.
28:36Each one has different advantages
28:38and limitations,
28:39so the classic example of a dimensional
28:42reduction that we always have in mind
28:45and possibly historically speaking,
28:47is one of the oldest is when you have a
28:51problem to draw a 2D map of the Earth.
28:55So Earth is 3D and you want that
28:572D map that keeps most of the
29:01reliable information on the geography
29:03on the geography of Earth.
29:05So where the continents are placed,
29:08their shapes, their areas and so on,
29:11so there are different approaches,
29:13many different approaches to convert
29:15the 3D map of Earth into 2D maps,
29:19and for example here you see one
29:21of the most famous projections that
29:24is called Mercator projections.
29:26So this was developed in the 16th century,
29:29and it's the one used by sailors because
29:32it keeps the directions and shapes.
29:36So it's a good map to know
29:38what is North East or West.
29:40And the problem is that there is a high
29:43distortion in this map of areas so that
29:45the the more you are far from the equator,
29:48the more areas that seems
29:50larger than they are.
29:51And so for example here it seems
29:53that Greenland is bigger than
29:55the whole of South America.
29:57That is not so. This is the distortion.
29:59So other projections such as these two.
30:03And they are projections where
30:05the the area is preserved,
30:07and so that this area corresponds really
30:09to the smallest areas with respect.
30:12For example to South America.
30:14But these kind of maps do not
30:16preserve shapes and direction,
30:18so that the common point is is any
30:21projection will be will distort
30:24the some of the features.
30:26So reduction of dimensionality is
30:28always an approximation and it brings it
30:31brings some distortions and deviations.
30:33And as a for the Earth map,
30:35we have different approaches also
30:37for our single cell data we see
30:40there are different techniques.
30:44Is this clear?
30:49Anyway, it's a very good analogy,
30:50so this is awesome feedback.
30:54So the first one that we will see with a real
30:57example is principal component analysis.
30:59So in our case we are studying cells
31:01based on the expression of genes.
31:03So in that in our simple example
31:05we will have six cells and since
31:07they are simple cells they have
31:09only they express or women age 25,
31:11only four genes.
31:12And So what you see here is the
31:15expression level of each gene
31:17from A to D in this six sets.
31:20So now we can use the expression levels,
31:23uh, so in in as a way to map cells
31:26and the expression level of each
31:29gene is a different dimension.
31:31So in this case we have a four
31:35dimensional space that obviously
31:36we cannot plot a on a 2D plot.
31:39So one simple so we could plot
31:41on a two diploid cells based on
31:44the expression of two genes,
31:47and so we can take jeanae engine be.
31:50And build these sort of map of
31:53these cells based on the expression
31:55level of gene eight that is our X
31:58axis and gene B that is our Y axis.
32:01And here you see where cells are
32:04located according to the expression
32:06of eight of these two genes.
32:09So the expression of each gene
32:11is a dimension.
32:13So now with this weekend plot,
32:15two genes in a 2D map,
32:18and so for performing
32:20principal component analysis,
32:21what is usually done at the beginning
32:24is to center the measurement,
32:26meaning that these genes here
32:29they have an average expression
32:31of seven for these jeans.
32:33This is the average of jinei
32:36across the six cells.
32:38And so on.
32:39So these gene B has an average
32:42of 4.5 GC of six and so on.
32:44So centering the data means that
32:47you calculated the mean expression
32:49of the gene across all these cells
32:51and you subtract the mean from all
32:54the values of the gene so that you.
32:57Switch from this matrix that is
32:59not centered to this matrix that
33:02is centered around 0.
33:03So I simply from the top
33:05row I subtracted seven,
33:07so 11 -- 7 is 4 and so on.
33:10From the second I subtract 4.5 and
33:12so on so that you see in the centered
33:16values are also negative and the common
33:18point is that the mean for each gene is 0.
33:23So usually before performing
33:25like I mentioned,
33:26I did action.
33:27This centering is is performed and
33:29it's also helpful in the visualization.
33:32So before centering,
33:33the cells were were looking like this.
33:35After the centering,
33:37these are the new coordinates,
33:38so nothing changed that it's only the
33:41origin of the axis and the position of
33:43the zero that are that are different.
33:46But if you look at the cells,
33:48the points are exactly in the
33:51same position as before.
33:53No one question that we can
33:55ask here is weather.
33:57So what you see here is that the
34:00difference here of the cells.
34:02We can capture the difference of
34:04these cells because they differ
34:06in the expression of A&B and one
34:09question we can ask is whether it
34:11is better to weather GTA or gene bit
34:14is better in separating these cells.
34:16So this corresponds to asking
34:18how much of the variability of
34:21the data is associated with this.
34:23Progression of GD or with
34:25the expression of Gene D.
34:28And so the question is,
34:30what is the variation of these six
34:33points that is associated with gene
34:37expression engine B expression?
34:39So there is a simple way to calculate
34:42the variation associated that
34:44corresponds to the formula of the variance.
34:48So this is an example to calculate the
34:51variation that is associated with gene 8.
34:54So here we're considering the X axis here,
34:58so I can draw a projection from each
35:01cell to these axis and calculate the
35:05distance from the origin to each cell.
35:08And basically,
35:09since here where we centered the data,
35:12the distance basically correspond
35:13to these expression level.
35:15So cell one has a distance of four cell,
35:18two or distance of five and so on.
35:22Now if we want to measure the
35:25variation with the variance formula,
35:27the variance formula is to take
35:29the square of each of the distance
35:32of these six distances.
35:34Some disc wears and then divide everything
35:37by the number of observation minus one.
35:40So this is how we calculate the
35:43variance of the expression of GD.
35:45So the formula here is the following.
35:48So we take the six distances,
35:51we square the distances.
35:52With some the results and we divide by 5.
35:56So the variance of jinei is 30.8.
36:00We can do the same for jinbe.
36:03And in order to have the variance
36:06associated with Gene B Now looking
36:09at this blotter A and C,
36:11it seems by I that gene A has
36:14more differences,
36:14higher variance than Gene B,
36:16and you can see these also by
36:19looking at the range of the axis
36:22minus 6 to 6 -- 4 -- 4 to four.
36:26So we can so the variance of GA is 30.8.
36:30In this case,
36:31the variance of gene B is less in this.
36:34Cases in 8.3.
36:37So the calculation of the
36:39variance of Steam V is the same,
36:41but instead of projecting
36:43cells on the X axis,
36:44we project cells on the Y axis and
36:47that's how I come with these results.
36:50Now we can see that if we consider
36:52the global variance of our data
36:54along these two dimensions,
36:56we can say that.
36:59That that the expression of gene A
37:02contains 80% of the global variance.
37:05And the expression of Gene B
37:08contains 20% approximately of the
37:11whole variance where the whole the
37:15whole variance is just 30.8 + 8.3.
37:18So if now I have to select only one
37:21of these dimension based on the on the
37:24fact that variation is information,
37:26I would select Jeannie.
37:27So if I have to drop one of the
37:30genes I would drop Gene B because
37:32it contains less information,
37:35less variance than Jenny.
37:38Now the question for PCA is
37:41whether it is whether is there a
37:44line that is not jeanae origin B.
37:47It's not one of these that captures
37:51more variation that maximizes
37:53the variation that is captured.
37:56So the question is to try to calculate
38:00the variance that is associated
38:02with each of these possible lines
38:05in the same way as we did here.
38:08But the changing the line and
38:11so changing this calculation.
38:13So this is a problem of like minimization
38:17of the distances or maximization of the.
38:21Of the various.
38:23And so we can find that
38:26among all the possibilities,
38:27the line that maximizes the
38:29variance for our data.
38:31In this case this is the line.
38:35That maximizes the variance,
38:37and basically what we found is the
38:40principal component want of our data.
38:43So principal component principal
38:45component one is exactly that the
38:48that the dimension that maximizes
38:50the variance of data with respect
38:52to all the other possibilities
38:55toward the other possible lines.
38:57In this case that cross the origin.
39:02Now once we identify PC one PC two,
39:05so the second principle component
39:06is the line that is orthogonal to
39:09the first step and this is easy
39:11because we are in a case where
39:13we have only two dimension so.
39:16The second principle component
39:18is simply the the line that is
39:20orthogonal to the principal component,
39:23one that we found.
39:24So once we identify this principal component,
39:27now we can represent our data
39:29not from the point of view of our
39:32original jeans of the expression
39:34of our original jeans,
39:35but from the point of view of
39:38a principal component want and
39:40principal component tool.
39:41So this means that we are rotating the data.
39:45In this way,
39:46so that now our reference system
39:49system of reference is not given by
39:52our regional expression but by PC1 and PC2.
39:57But the data are always dissing.
40:00They didn't change their respective
40:02localization, so we just rotated the data.
40:05Now the advantage of doing this is
40:08that now if we calculate the variance
40:12associated with PC one and PC two,
40:15we can see that a difference with respect
40:19to our original to our regional dimensions.
40:22So we can see that PCA captures almost 100%.
40:27Of the variance of our data
40:30while PC two captures much less.
40:33And this is because.
40:35Exactly because PC one was
40:38selected because it was maximising
40:40my maximising this value here.
40:43So and here you see the difference
40:45between the variance with the
40:47original dimension gene and gene B,
40:49and with the new principal components.
40:51So the advantage of the technique is that
40:53now if I want to drop one of the dimension.
40:57So if we want to pass from 2
40:59dimensions to one dimension,
41:01if I select PC one,
41:02I lose a less than 5% of the information,
41:05while with the original gene and gene B,
41:08if I choose ginae I had to lose
41:1120% of the information in this way.
41:13I reduced dimension.
41:14It can reduce the dimension from 2
41:17dimensions to one that keeping almost
41:20all of the information of the data.
41:22And this is the trick used by
41:26principal component analysis.
41:27Ah, so.
41:28This is a more complex example,
41:31so this was an example with four dimensions.
41:33If you remember our regional,
41:35our original table was with four genes,
41:37so we can do the same with four jeans.
41:40With four dimensions we can calculate
41:43the original variance associated
41:44with each of the original jeans.
41:46So Gene age in BC and D expressed as
41:48a percentage of the entire variance.
41:51And again,
41:51if I had to choose the two genes
41:54containing most of the variance,
41:56I would choose jeanae engine, see.
41:58But still I would lose.
42:0010% of the variance associated with
42:03Gene B and 20% associated with Gene D.
42:06Like if I perform principal
42:08component transformation,
42:09I found I find four principal
42:11components in a way that the first
42:14step maximizes the explained variance.
42:17II is orthogonal to the first,
42:20and maximising maximizes the
42:21residual variance and and so on.
42:24So the advantages that now if I consider
42:27these two components and they remove it.
42:30These two I only lose that like 3 to 4%
42:34of the variance and I can keep more than 90%.
42:39While here I could keep
42:41only 70% of the variance.
42:44And if I consider only these two
42:47dimensions and I plot my data,
42:50my cells here, I can obtain this plotter.
42:54So these are are the original cells
42:56based on the expression of these
42:59four genes plotted in the first two
43:02principal component where dimension
43:04one explains 74% of the variance
43:07and dimension to explains 23%.
43:09This corresponds to these values here.
43:14And the advantagous PCA. So this.
43:16So again,
43:16the trick was to reduce the space
43:19from four to two dimensions,
43:21but keeping most of the information.
43:24And so they they new dimensions,
43:26dimension, PC one and PC two are
43:29combinations of linear combinations of
43:31the old dimensions and the advantage
43:33of PCA is that I can easily calculate
43:36how much the expression of the original
43:39jeans is important in each of the
43:41newly found principal components.
43:43For example, in a plot like this.
43:46And this is the this is a plot that shows
43:50that principal component one captures
43:52a lot of the expression of Gene 8.
43:55B&C while Gindi is not very
43:59important in principal component
44:01one while principal components,
44:04who is mainly capturing
44:06the expression of Gene D.
44:10And in this example,
44:11a explanation of this is if you look
44:14at the original values that gene
44:16AB&C are very highly correlated,
44:18so they're highly expressed in
44:20the first three cells and low.
44:22They have low expression in the
44:25in the four to the 6th cells,
44:27while gindi is a little bit
44:29different because gene is highly
44:31expressed in cell 24 and five and
44:34low expression in 1/2 and three.
44:36So this means that Gene D is not correlated.
44:40With the expression of the other genes,
44:42so that's why using PCA I can capture
44:45the correlated expression of these three
44:47genes in the first principal component.
44:50And and the the.
44:52Uhm, expression of Gene D that is
44:56different and not correlated with the other.
44:59Using the second component.
45:01The second dimension,
45:03obviously in the real case scenario
45:06we start from,
45:07if we start from 3000 genes,
45:10we start from 3000 of dimensions.
45:13But if you look at PCA
45:16plots up sometimes you can.
45:19You can always find also
45:21the percentage of variance.
45:23That is explained from each dimension
45:26so you can see how much of the entire
45:30information of the data can be
45:33explained only using two dimensions
45:35and how much you are missing.
45:40Uh, no PCA is, uh,
45:42it was worth explaining because he's
45:44still one of the most used at techniques.
45:47Also in single cell data analysis.
45:49But you don't see PCA offer in the
45:52visualization of single cell data.
45:54And that's because the principal
45:56component analysis, as I said,
45:58has the advantage of being highly
46:01interpretable because from the
46:03components I can go back quite
46:05easily to the to the original jeans,
46:07so I can establish which genes are important
46:11are important in each of the dimensions.
46:13It is computationally efficient,
46:15but when I want to visualize a
46:18single cell RNA seek data, it's.
46:22It's lesser it's not very
46:24appealing to the eye,
46:25so and the reason for this is again,
46:28that the data in single cell are nonlinear.
46:31They have an excess of zeros,
46:33and so if you plot the principle,
46:35the first two principal components,
46:37often you don't have a clear
46:39separation of cells,
46:40and that's what you want to show,
46:43especially if you want if you're generating.
46:47Figure that is going to represent your data.
46:50So for this reason,
46:52mainly for the visualization,
46:53not for the analysis of the data,
46:56come in the first year of single
46:59cell analysis of the most employed
47:02approach was called the Disney,
47:04so it's at least a caustic
47:06neighborhood embedding.
47:07So this approach is not linear,
47:10as principal component is
47:12based on graph methods.
47:13So on this I will not spend.
47:17A lot in explaining how it works,
47:20but basically it's a random procedure and
47:24being nonlinear it means that it corrects.
47:28The original data using the
47:30nonlinear equation equation and
47:32the advantage is that it's better
47:34in showing clusters of cells,
47:36so that's it.
47:37It's able to retain the local
47:40structure of the data in low dimension
47:43where the low structure local
47:45structure means cluster of cells
47:47that are very similar to each other.
47:50The disadvantage is that it's a
47:53stochastic method so that each iteration
47:56can produce a different result.
47:58That's not true for PCA.
47:59It has a long time to run,
48:01especially when you increase the number
48:03of cells and it's considered to be bad then.
48:06In keeping the global structure of the data,
48:10and I have an example of design,
48:13so this is a data set with a.
48:16I think it's balcony seeker samples
48:18from different cancer from the.
48:20So each color here is a sample
48:23from a different cancer type and
48:26then the same data where run twice
48:29with the Disney approach and these
48:31are the two outputs so you can see
48:35that that something is conserved.
48:37Between the two run.
48:38Uh, so the number of cluster and they come,
48:42the size of the class and probably they they.
48:45The assignment of each sample to
48:47each cluster has been conserved.
48:49Them and also the shape of this
48:51single clastres somehow closer.
48:53But if you look at the organization of
48:56the whole cluster of the of the classes,
48:59that is different.
49:00For example,
49:00these orange cluster here in
49:02run one is in the
49:04middle and while uh and and green
49:07is opposite to read the wild.
49:09Red and green are very near to each other,
49:13so for the capturing the class time,
49:16visualizing the class set
49:17this method it's good.
49:19But then if I start the interpreting
49:21the distance between different clusters,
49:24these methods is not any more valuable,
49:26so it's not reliable because it depending
49:29on the initial random step of the analysis,
49:32it could lead to different maps
49:34and that's the main reason.
49:36Yeah, I'm sorry, yeah.
49:39I was just wondering so like is there
49:41any value in like sort of running the
49:44program like a bunch of times, right?
49:46Like just iteratively and then
49:47taking the average of the distances?
49:49There's some truth that emerges there
49:51when you like repeat it a whole bunch
49:53of times or it's just not useful.
49:57Uhm, I don't think I can answer
50:00to that personally, so I wouldn't
50:03know the answer to this question.
50:07Uhm, I don't know if anyone tried,
50:10so there is a way to reproduce
50:12the analysis to performing so
50:14called of pseudorandom analysis,
50:16meaning random analysis.
50:17When you run a program is based on
50:20a seed that is a random number,
50:22but it can be kept.
50:24It can be remembered during the different
50:26iterations and if you keep the seed
50:29constant you can reproduce results,
50:31but that's not.
50:32So that's a way to keep
50:34consistent the program if you run.
50:36The program in different on
50:38different machines, for example,
50:40but I don't know if that.
50:45If this has been done,
50:46maybe so, but I don't know what
50:48would be the result of that.
50:50So to run it a lot of times
50:52and trying to capture a sort
50:53of stability in the distances,
50:55sure sure. So in the field,
50:59are the fact that if, for example,
51:02if you look at the publication so you can,
51:05you can like data the analysis according
51:08to the method that they used to visualize.
51:11So if you look at some plot and
51:13it's at Disney probability analysis
51:16is before a 2018 2018 because in
51:192018 the what is now the the the
51:22most used approach to visualize
51:24the single cell data has been
51:27presented and that's the you map.
51:29The human method.
51:30So if you see a plot that is uses as
51:34a dimensionality reduction technique,
51:36the human deaths from
51:382018 to now more or less.
51:42So the humor is also another
51:44question is also common, sort of, uh,
51:47so I've actually noticed that if you,
51:50for example Disney,
51:51if you just have a completely random data,
51:54so let's say, OK, I generate the a
51:58computer generated completely random data.
52:01Supposedly it will be a fuzzy
52:03ball in this PC plot,
52:04but if you do in the Disney it will
52:07become some kind of patterns you can
52:09start to see emerging from that.
52:11I just wonder for things like
52:13you map and other things. Is
52:15this also the same problem or
52:17it's a yeah it's the same problem
52:20so all these methods because all
52:22these methods like try to maximize.
52:25The separation of these objects and
52:27the the problem is that the boss 40s
52:30and also for your map you have noise.
52:33So if your differences mainly driven
52:35by noise, UM, they tried, they they.
52:37They basically create patterns from noise,
52:39yeah. And this is a less a problem
52:43with the PCA. That's right, yeah.
52:45So this is not solved by the human,
52:48So what it seems to be solved by
52:50the UMAP is mainly that it's faster.
52:52Uh, it's faster than his name,
52:55so it can be applied in a reasonable time.
52:58So when the data set is very high in
53:01terms of number of cells and also it
53:04seems to be better in a preserving
53:07this global structure of the data so
53:10that also the distances between the
53:12different clusters are more like reliable,
53:15I think there is actually a parameter
53:17where you kind of tune how much you are.
53:21You give weight to the.
53:23Local structure or to the global structure,
53:25but it's generally considered to be more
53:28reliable on the global structure of the data,
53:31so it's considered to be like a
53:33trade off a good tradeoff between
53:36the PCA and the Disney approach.
53:39Barca, it hasn't.
53:40So both these methods have problems.
53:42For example in interpret ability.
53:44So PCA is easy to go back to the
53:48original jeans in Disney and
53:50you map it's very problematic.
53:53And uhm,
53:53yeah.
53:54And also you map is a random
53:56so different runs,
53:58so give you slightly different results.
54:02Can I ask you another quick question?
54:05So when you say the difference in
54:07time for like processing the data,
54:10what is the scale of that time?
54:12Are you saying like hours or days?
54:16Ah well, it did, but it's,
54:18uh, well it diverted.
54:20Depending on the number of,
54:21uh, cells, so it could be
54:24that if you have 100 cells,
54:26you don't notice the difference,
54:28but scaling. So adding data you.
54:32You delete a lot, so for this day
54:34I know that there are a lot of
54:36variation of these names that have
54:38been working on the efficiency,
54:39so they are faster. Uh, but, uh.
54:42I guess it's also problem of memory,
54:46so personally I never run an analysis
54:48on a sample that was more than
54:5020 thirty thousand cells and so
54:53personally I don't know how problematic
54:55it is to work with Disney with a
54:58large data set of 1,000,000 cells.
55:00But the problem is that the
55:02more cells you are,
55:03the more you gain in time.
55:05Using you map against at least basic Disney.
55:09Sure. OK, so this is only technical.
55:13It's not really.
55:14So the the key point also is that depending
55:18on the dimensionality choice you make.
55:22There is an answer look different,
55:24so this is the same data set up,
55:27so normalization is the same.
55:29The input data was the same,
55:31it's from a mouse brain.
55:33So here you see some populations
55:35that correspond to neurons,
55:37different types of neurons and microglia,
55:39and solar cells.
55:40And you see the representation of these
55:42datasets using principal component analysis.
55:44So cells here are colored according
55:47to the cell type and and and you
55:50see that you can see the classes.
55:52But you but but for example,
55:54points within the same clusters.
55:56That kind of spread around,
55:58so that's why for the visualization
56:00of the cluster this is less
56:03clear than the other two methods.
56:05So these two methods try basically
56:08maximizes the completeness of
56:10the data inside the cluster.
56:12And using different two different approaches.
56:19And then again, for interpreting these data,
56:22it's more about just seeing
56:23like how cells are similar to
56:25each other within a cluster,
56:27like how they cluster separately
56:29as opposed to like the distances
56:31between the cluster being able
56:32to infer any relationship from
56:34that distance, right? Well,
56:36here you could be interested also well,
56:38for sure if you use Disney,
56:40the distances between them as you see here,
56:43the distances tend to be like.
56:46More or less the same,
56:48so their equally distributed button
56:50here in new map you can have a
56:54distances that are that have a
56:56range so low to high distances.
56:59So here it can be informative.
57:01But basically when you use
57:03this for the visualization,
57:05what you want to communicate is
57:08that you identify the clusters of
57:11different cells and you want to be
57:14able to see them where they are in.
57:17Which relation they are?
57:18How many cells belong to each cluster,
57:20and so on.
57:21So usually these then are annotated
57:23with the name of the cluster based
57:26on marker genes and so obviously
57:28these for the visualization are
57:29better if you want to label
57:32your clusters and then this.
57:33But PCA is still the common tool,
57:36one of the most common tools that
57:38are run in the downstream analysis,
57:40meaning that Disney new map I
57:42really used mainly for visualization
57:43of data but nothing else.
57:45Specially it isn't but PCA.
57:47Is the basis for the clustering and
57:50for the trajectory analysis and so on.
57:52So still all the pipelines many of the
57:55old departments are still using the PCA,
57:58it's just for the visualization that
58:00they use these alternative approaches.