Analysis and Interpretation of single cells sequencing data – part 3 Key Analysis Steps
August 25, 2021Information
- ID
- 6878
- To Cite
- DCA Citation Guide
Transcript
- 00:05So today we will finish our.
- 00:08Session on single cell data analysis.
- 00:12So far we arrived at a this step in
- 00:16the analysis, so after all the quality
- 00:20controller and the normalization,
- 00:22a necessary step to reduce the complexity
- 00:25of the data set is the reduction
- 00:29of dimension of dimensionality.
- 00:31So reduction of features.
- 00:33So there are two possible ways.
- 00:35Feature selection extract only
- 00:37relevant genes and also methods
- 00:40for dimensionality reduction.
- 00:42So last time we saw.
- 00:44Uh, the principal component analysis and
- 00:46other tools that are used for single cell,
- 00:49especially for the visualization that are
- 00:52they Disney method and the UMAP methods?
- 00:55They're both nonlinear and the graph based.
- 00:59So today we will see briefly
- 01:01the remaining steps,
- 01:03downstream steps of the analysis.
- 01:05And we will start from the clustering.
- 01:13Of single cell data.
- 01:14So we have our cells are there
- 01:17mapped in our low dimensional space
- 01:20and we want to identify clusters,
- 01:24meaning cells that have a similar
- 01:26expression signature so that they
- 01:29are very similar to each other.
- 01:31So since this problem of clustering
- 01:33is quite general and it was also
- 01:36covered by June in the first
- 01:38lesson of these didactic section,
- 01:40many of the methods that.
- 01:43Can be used for single cell data are
- 01:45the same that he covered last time.
- 01:48So here is what you see.
- 01:50Here is an example of a data set.
- 01:52Uh of cells,
- 01:54and I think they are divided because
- 01:58of a different developmental stage.
- 02:02And so these clusters that you
- 02:05see here are a derived are known
- 02:08because of the experiments was
- 02:11that these cells were isolated on
- 02:14different different differentiation
- 02:15States and there are already
- 02:18mapped in a low dimensional space.
- 02:20So what you see here is that they are
- 02:23the first two principal components.
- 02:26So usually clustering always per
- 02:28calculated on principal components
- 02:30or on a reduced dimensional space.
- 02:33This doesn't mean that you
- 02:34only take the first two.
- 02:36You have to select the
- 02:38first step 1020 thirty.
- 02:39There are methods to select for
- 02:44a certain number of dimensions
- 02:46so that you keep what could be
- 02:49worthy of information and you
- 02:51remove the lower dimensions that
- 02:53are associated with less violence.
- 02:56And the assumption is that
- 02:59they mostly capture noise.
- 03:01But here the example in the examples.
- 03:03Here is a simplified example,
- 03:05so you see only PC1 and PC2.
- 03:08So in order to class so we you have
- 03:11a the two approaches that June
- 03:13covered in the clustering lesson
- 03:16was a hierarchical clustering.
- 03:18So these methods try to
- 03:20connect progressively.
- 03:21A cells that are similar to each other.
- 03:26So also in this case if you remember
- 03:28there is the concept of distance.
- 03:30So all clustering methods are
- 03:32based on the fact that you have
- 03:34to calculate a sort of distance
- 03:36or similarity measure between.
- 03:38Pairs of cells.
- 03:39So for example,
- 03:40here you can measure the distance
- 03:43as the Euclidean distance in
- 03:45this principle component space.
- 03:46Otherwise you can use other
- 03:48measures of distance,
- 03:50for example correlation and so on and.
- 03:53The choice of the distance,
- 03:56the choice of the distance will
- 03:59change the clustering results.
- 04:00So these old stands true.
- 04:02So hierarchical clustering try
- 04:04to connect progressively similar
- 04:06entities from app until arriving
- 04:11to unifying everything.
- 04:12And then the problem of hierarchical
- 04:15clustering is to decide when to cut the tree.
- 04:18So depending on where you cut the tree,
- 04:20you can separate like if you cut
- 04:22this tree here, you separate.
- 04:24Two clusters here.
- 04:25You separate three and so on.
- 04:27In this example,
- 04:28you know that there are 12345
- 04:31clusters main clusters,
- 04:33but this information is not always a obvious.
- 04:37So always in clustering the number
- 04:39of cluster the optimal number
- 04:41of clusters that represent your
- 04:43data is always a tricky choice
- 04:46and ultimately subjective.
- 04:49The second approach that Jonah
- 04:51explained last in the lesson
- 04:53about clustering, is K means.
- 04:55Clustering is based on the fact that
- 04:59you select a priority before beginning
- 05:01a number of cluster that is key in
- 05:04which you want to divide your data,
- 05:07and then you apply a sort of
- 05:11iterative procedures that is based
- 05:14on the definition of a centroid.
- 05:16Centroid is the.
- 05:18Average point of a cluster
- 05:20so it's not real point,
- 05:22but it's a appointed represents
- 05:24the average of all the points
- 05:26that belong to the cluster,
- 05:28and so the procedure is to iteratively
- 05:31assign each cell to the nearest
- 05:34centroid until you reach convergence.
- 05:38So until you.
- 05:41In consecutive iterations,
- 05:42for example, you don't have any
- 05:45change of lab labels between sets.
- 05:47A family of methods of clustering
- 05:50methods that are widely used
- 05:52in a single cell approaches is
- 05:55they are the graph based family.
- 05:57So this is something that John
- 05:59didn't talk about, so the principle
- 06:02here is to build them a graph.
- 06:05Uh, on this space and they usually
- 06:08mean it is a so called key.
- 06:12Nearest neighbor graph.
- 06:13That means that for every cell
- 06:16you draw a line that connected the
- 06:19seller with a top nearest cells
- 06:21and that's the key parameter.
- 06:24So for example in this example here,
- 06:26this is a 10 nearest neighbor graph,
- 06:31so it means that each cell
- 06:33here is connected to the top.
- 06:3510 nearest excells.
- 06:37So after you do this basically
- 06:40Europe map becomes a graph and
- 06:44so a graph is a a set of nodes
- 06:47that and each node here is a
- 06:50cell with a set of connections.
- 06:52So these connections can
- 06:55can also be waited so.
- 06:57A weight can be assigned to eat
- 07:00connection depending on how similar
- 07:02the two cells are and the method.
- 07:05Once you build that the graph is
- 07:08to identify inside this graph that
- 07:10can be also seen as a network.
- 07:13Basically to identify communities
- 07:16so to identify inside this graph.
- 07:20Communities of nodes are so clusters of
- 07:22nodes that are highly interconnected
- 07:25among themselves and with low
- 07:29interconnections with other clusters.
- 07:31So obviously if like,
- 07:32such as in this case,
- 07:34you obtain not a single network
- 07:37but a different networks that are
- 07:40completely separated, it's easier.
- 07:43It's obviously easier to separate
- 07:45these three clusters,
- 07:46because the in the graph they don't
- 07:48share any connections, but sometimes the.
- 07:51Graph based approach.
- 07:53Tried tried also to cut to within uh,
- 07:56these networks in order to yes.
- 08:02Sorry it was there a question.
- 08:06That was veins veins.
- 08:07Did you have a question?
- 08:11Sorry, I accidentally had my microphone on.
- 08:15Sorry. And so these measures try to try to
- 08:20divide a network inside the communities,
- 08:23so they try to cut the networks
- 08:26in order to increase the density,
- 08:29increase the density of the chunks
- 08:33in which the network is divided.
- 08:36So here I have a slide that
- 08:39explains these a better,
- 08:41but the principle is that you built at
- 08:43the graph where each node is a cell
- 08:45connected to the nearest neighbor
- 08:47and then you try to identify the
- 08:49communities by creating the cuts
- 08:52inside the network and the cuts have
- 08:56to isolate the parts of the network so
- 09:00that you don't remove a lot of links
- 09:03and you increase the density of the links.
- 09:06Offer what remains so there are
- 09:09many approaches to to do that in
- 09:12the single cell pipelines,
- 09:13especially in the most popular
- 09:16methods you will see always the logon
- 09:19method for the community detection.
- 09:24The advantage of this is that
- 09:26many other methods for single cell
- 09:29analysis uses the same approach,
- 09:31so they first build that these
- 09:34graph and then they try to
- 09:37either a identify the clusters as
- 09:40communities of the interconnections.
- 09:42And then we will see that also.
- 09:44For example, many trajectory tools
- 09:47will use this kind of graph in
- 09:50order to build the trajectory.
- 09:55So this is for the class ring.
- 09:58Uhm? Another important step,
- 10:00once you identify the clusters is to
- 10:04perform A to identify the genes that
- 10:08are characterizing these clusters and.
- 10:12That means that you want to identify the
- 10:16so-called marker genes for each cluster.
- 10:19So these are ideally jeans that
- 10:21are expressed only in the cluster
- 10:23in in a single cluster and not
- 10:25in the in the other cluster.
- 10:27And here you see an example.
- 10:29So this is a.
- 10:31Representation are you map representation
- 10:33of the first two dimension of a single cell,
- 10:37analysis of peripheral blood cells.
- 10:41So you see there are clustering.
- 10:43Louvain clustering has been
- 10:45applied and identified,
- 10:47eight cluster a cluster of cells that
- 10:50you see here from zero to 7 and the
- 10:55identification of the marker Gina is a
- 10:58identification of genes that will be useful.
- 11:02In order to annotate the cluster.
- 11:04Uh, because for example,
- 11:06as you see here,
- 11:07if you look at the at the
- 11:09expression level of this gene,
- 11:10you you only see this kind of representation.
- 11:14Also in like paper or analysis of single
- 11:16cell you can represent the same map,
- 11:19but instead of coloring the cells according
- 11:21to the cluster you call on the cell
- 11:24according to the expression of 1 gene.
- 11:26So here are cells are Gray if the
- 11:30gene MS4A1 is not expressed and
- 11:33they are violent if the genie.
- 11:35Is expressed a lot and you see that
- 11:37this is a good marker for cluster
- 11:40three because you see that the marker
- 11:42is highly expressed only in this
- 11:44cluster and most of the other cells
- 11:47do not express these genes at all.
- 11:49So the task here is to is to
- 11:52identify genes that are like this,
- 11:55so genes that are highly expressed only
- 11:57in one cluster and not in the other.
- 11:59So basically this task is very similar
- 12:01to a task of differential expression
- 12:03because what you want to do is to identify.
- 12:07And for each cluster,
- 12:08genes that are differentially expressed,
- 12:10in particular,
- 12:11more expressed in the cluster.
- 12:13Then in all the other cells.
- 12:15So you divide your cell in two sets.
- 12:20It's belonging to the cluster
- 12:21and all the other cells,
- 12:23and then you try to identify genes
- 12:25that are differentially expressed,
- 12:26differentially expressed.
- 12:29So the aim of this task is to identify
- 12:31genes with different expression,
- 12:33usually among clusters of cells,
- 12:35and because these are important
- 12:37because they are cluster markers.
- 12:39Now this is something that also
- 12:41we covered that in the second
- 12:44lesson Everett spoke about.
- 12:46These are a task differential expression
- 12:49analysis will buy with bulk RNA seek.
- 12:52So the approaches them.
- 12:54That are used in single cell are different
- 12:58and this is a an area where there is no.
- 13:02Everything makes my favorite methods,
- 13:04so there are two papers that try to
- 13:06do a benchmark and compare different
- 13:08methodologies to do differential expression,
- 13:11but the result is that no method
- 13:13is better than others.
- 13:15And the problem is that when you consider
- 13:19the expression, the distribution
- 13:21of expression level of genes of
- 13:23single genes in a single cell data,
- 13:25they're very thorough genius.
- 13:27So here you see three examples
- 13:30of the expression densities
- 13:32across cells of three genes.
- 13:35You see that these ambience
- 13:3817 has these density,
- 13:40so that could be like a approximated
- 13:43with a like a normal distribution.
- 13:46Or less these A is a has a 2 pics
- 13:51one pick a one pica of cells
- 13:55where the gene is not expressed.
- 13:58This is quite common in single cell
- 14:01because of the dropout events.
- 14:03Sorry if you remember we.
- 14:05That's one of the main problem of
- 14:06single cell in a lot in a lot of cells.
- 14:09At the absence of a gene is not
- 14:11biological but it's technical,
- 14:12so the genie is simplest,
- 14:13was not captured in a library preparation.
- 14:16So that's why for many genes
- 14:18you have this situation.
- 14:20You in the same cluster of cells you
- 14:23have some cells that express the gene
- 14:25and stem cells that do not express the gene.
- 14:28And then in other in other cases,
- 14:30for example, you have most of the
- 14:32cells with the zero expression.
- 14:34So this means that there is no.
- 14:37Unique distribution that allow
- 14:39you to a model.
- 14:41The expression of genes.
- 14:43So that's why.
- 14:46Popular way of approaches for
- 14:48a do for doing differential
- 14:50expression with single cell data
- 14:52is to use nonparametric test.
- 14:55So nonparametric tests do not make
- 14:57any assumption on the underlying
- 14:59distribution of the expression.
- 15:02For example,
- 15:03probably the most used is the Wilcoxon rank.
- 15:06Sum tests. A lot of these tests.
- 15:09Do not use that.
- 15:10Do not compare the real values,
- 15:12but they compare ranks so they
- 15:15transform numbers into ranks.
- 15:16Once you order numbers from
- 15:19the highest to the lower.
- 15:22Uh,
- 15:23so they can be used with single cell also,
- 15:25because uh,
- 15:26when you have a lot of cells you
- 15:28have a lot of measurements and
- 15:30so these kind of tests work well
- 15:33when you have a lot of replicates
- 15:35because you can consider each cell
- 15:37as a replicate when you do the UM,
- 15:40try to establish the difference,
- 15:43they are problematic since they
- 15:46work on ranks,
- 15:47their problematic when you have a
- 15:49lot of values that are the same.
- 15:50So these are tide values and
- 15:52that's exactly what happens.
- 15:54With the zeros,
- 15:55so this could be a problem of
- 15:57applying nonparametric test when
- 15:58you have a lot of cells where
- 16:01the gene is not expressed at all,
- 16:03then the other methods are the same
- 16:06as the one used in bulk RNA seek,
- 16:09so these were the one covered by Everett H.
- 16:14R&D C2 and they are based on
- 16:18modeling the gene expression with
- 16:20a negative binomial distribution.
- 16:22And then you have a lot of methods that
- 16:24were developed for the single cell,
- 16:26and so instead of the negative
- 16:28binomial they used other distributions
- 16:31that dealer with the accessor
- 16:33of zeros that you have in single
- 16:35cell data set again.
- 16:37So these are the three
- 16:38main families that you
- 16:39will find. There is no clear winner or an
- 16:42approach that is more used than others.
- 16:48Uhm, so finding the marker genes,
- 16:51we said it was.
- 16:52It's very important because uh,
- 16:53it's necessary to understand,
- 16:56uh, understand that, uh,
- 16:59the identity of each uh, set cluster.
- 17:02And it's a important to label
- 17:05cell clusters with the cell types.
- 17:07So this is a problem that is
- 17:10called the cell type annotation.
- 17:13So the aim is that you want to annotate
- 17:15cluster with a known cell types.
- 17:17Depending on the system.
- 17:19That you are studying.
- 17:21So you want to.
- 17:22If you're speaking about that,
- 17:23for example,
- 17:24peripheral blood you want to
- 17:26associate a these clusters with a
- 17:29known population of blood cells that
- 17:31you find that you expect to find.
- 17:34So T cell B cell and so on.
- 17:38There are the main approaches are
- 17:41obviously the manual approach you look
- 17:44at the marker gene and you know which
- 17:46are the genes that should be highly
- 17:49expressed in each cell population,
- 17:51so you know which are the B cell markers,
- 17:53the T cell markers and you use your
- 17:56personal knowledge to annotate the cluster.
- 17:59This is probably has this has been in
- 18:02the past analysis of single cells.
- 18:05The most used method.
- 18:06And that's why since it's manual is
- 18:09based on personal knowledge is also
- 18:11very time consuming because you need
- 18:14to review all the clusters and to
- 18:17assign annotate each cluster manually
- 18:20based on your subjective knowledge.
- 18:23There is a big development,
- 18:25a huge development of automatic
- 18:28tools to perform this step.
- 18:31So to perform cell type annotation.
- 18:34And these automatic,
- 18:37uh procedures AR of can be divided into two.
- 18:42So there are some procedures that are
- 18:45based on databases of marker genes.
- 18:47So what they do is that what this
- 18:50procedure do is that they they
- 18:52compare the list of marker genes
- 18:55of each cluster with a database
- 18:59of marker genes that were found.
- 19:03Experimentally,
- 19:04in a known population of sensor,
- 19:07so the comparison is between
- 19:09different lists of marker gene.
- 19:11The advantages that you don't know
- 19:13you don't necessarily need another
- 19:15single cell data set as a reference.
- 19:18You just need a list of genes
- 19:21and we will see there are there
- 19:24are databases that try to.
- 19:27Cover all the marker genes
- 19:29for cell populations,
- 19:30at least in human and mouse.
- 19:33Another family of approaches require not
- 19:36only the unknown list of marker gene,
- 19:39but require a unannotated.
- 19:45Did expect single cell RNA seek experiments,
- 19:48so they they.
- 19:49They strategy is what is represented here.
- 19:52You have a query data set that is
- 19:54your it's your data set where you
- 19:57have classes but you don't have
- 19:59labels and then you have a reference
- 20:01data set so someone else already
- 20:03did perform their analysis of single
- 20:06cell and label the clusters of cells.
- 20:09So the strategy is to try to
- 20:12identify which of the clusters.
- 20:15Of the query data set are more similar
- 20:18to the reference and and and this is
- 20:21a problem of classification basically,
- 20:23so they try to classify a data set
- 20:25with unknown labels using a data set
- 20:28set of single cell with known labels
- 20:30and those Indies are in this family.
- 20:32Obviously you have many possible
- 20:36math methods to do this,
- 20:38some some of the methods are based
- 20:41on correlation. Try to calculate.
- 20:45The similarity through correlation measures.
- 20:47Other approaches try to use a supervised
- 20:51classification and methods that are
- 20:53commonly used in in machine learning.
- 20:55So this is one of the field where,
- 20:58like speaking about 2021 there are there are.
- 21:03Very huge developments and a lot of
- 21:05tools that are published or in either
- 21:09Inbox Ivorian on journals right now.
- 21:13And there is this metaphor that once we
- 21:15have a lot of once, we have a lot of.
- 21:20Uh, datasets that are annotated uh.
- 21:23These Tasker will become a
- 21:25such as that will become like
- 21:27mapping reads to unknown genome.
- 21:31So performing a single cell analysis
- 21:34within a new data set will become
- 21:36as simple as that because you have
- 21:38a lot of reference populations and
- 21:40so it will be easier to annotate
- 21:43your cell once you have a collection
- 21:46of references that is reliable.
- 21:58These are two resources to databases that
- 22:01collect cell type annotation markers
- 22:03and so this can be used to compare the
- 22:08markers identifying your cluster with
- 22:10a known collection of markers that were
- 22:14identified based on a single cell data.
- 22:17There is this sort of so if you
- 22:20look at flow cytometry, the. The.
- 22:22The best markers are considered to be the
- 22:26proteins that are expressed on the surface.
- 22:29The problem is that the transcripts of
- 22:32these proteins of surface markers are
- 22:35not always among the top expressed genes,
- 22:39and so they may be subjected to dropout
- 22:42events and so the best collection of
- 22:45cell markers for a transcriptomic study.
- 22:48So based on gene expression is different
- 22:50from the best collection of markers based on.
- 22:54Some surface proteins?
- 22:59And, uh, these two databases
- 23:01collected single cell signatures,
- 23:03single cell markers in different
- 23:06samples and tissues mainly,
- 23:08or human and mouse.
- 23:11So also looking at at comparing the
- 23:14two species is important to know which
- 23:17markers are conserved across species
- 23:19and which one are species specific.
- 23:26Then the last step for two days
- 23:31is the trajectory analysis.
- 23:34So why is clustering tries to divide
- 23:38the cells into discrete clusters?
- 23:41The idea of trajectory analysis
- 23:43is that you are not monitoring.
- 23:47You are not capturing
- 23:49sales in discrete states,
- 23:51but you're capturing a sort of continuous.
- 23:54Processor for example,
- 23:57differentiation, uh for example,
- 24:00yes differentiation.
- 24:01So these kind of methods try to place
- 24:05sells a longer a continuous path that
- 24:09represents the evolution of a process.
- 24:11This could be differentiation,
- 24:13but for example, if you imagine a another
- 24:16simple example is the cell cycle.
- 24:19And so instead of dividing
- 24:21cells into separate cluster,
- 24:22you try to construct a sort of trajectory
- 24:26that models this progression through a,
- 24:30for example, differentiation.
- 24:32And, uh, uh, this sort of, UM,
- 24:35tools, trajectory, inference,
- 24:37analysis are also named sometimes like a
- 24:40term desktop sealed the time analysis,
- 24:43because still the timer is a basically,
- 24:47a measure is an abstract measure of
- 24:50the progression through the process.
- 24:52So from when the program when the
- 24:55process starts to where it ends.
- 24:58So the important assumption is
- 25:00that in order to perform trajectory
- 25:03analysis is that we are capturing
- 25:05with our single cell experiments.
- 25:08All the snapshot of the process
- 25:10that we want to model and this
- 25:13means that we are capturing also
- 25:16the intermediates because all the.
- 25:19The analysis it spins on the
- 25:21assumption that we have a continuum,
- 25:23not discreet,
- 25:23and so we need to have some to
- 25:26capture some cells that represent
- 25:29the transition between,
- 25:30for example two differentiation.
- 25:35Haynes so the assumption is
- 25:38that we're capturing all the
- 25:40snapshots and we don't have holes,
- 25:42and so we have a lot of intermediates
- 25:44and the warning is that, uh,
- 25:46any data set, so these tools will
- 25:48will will capture a trajectory for
- 25:51each data set that you use as input.
- 25:54But this doesn't mean that the trajectory
- 25:56that you find has any biological meaning.
- 26:00So the common approach to do these
- 26:02are there are a lot of methods also
- 26:05for these common and simply to explain
- 26:08approach is are represented here so.
- 26:12These are is a PCA plot and here
- 26:15instead of only two dimensions,
- 26:17you see the three dimensions,
- 26:19so PC1PC20 PC three.
- 26:22Each dot is a cell.
- 26:25What you do is you first perform
- 26:27a clustering of cells with K
- 26:30means with graph based approach.
- 26:32So this depends on the on the tool and
- 26:35so identify these clusters of cells.
- 26:38Inside your population now if you remember,
- 26:42each cluster can be associated with
- 26:44a centroid where the centroid is
- 26:47the central point of the cluster.
- 26:49So it it is in the position in
- 26:51the mean position with respect to
- 26:54the elements of all the cluster.
- 26:56And these dots here represent the
- 26:59centroids of the cluster that you identified.
- 27:02Now what you do is you try to build a tree.
- 27:05That connects these centroid and
- 27:07those who build these three.
- 27:09There are many strategies.
- 27:11One of the most simple is to
- 27:14build the minimum spanning trees.
- 27:17So you're trying to connect all these
- 27:19points in a way that minimizes the length.
- 27:22The total length of the branch.
- 27:25The branches,
- 27:25so if you have a set of points,
- 27:28you can find a solution with a tree
- 27:32that minimizes the length of the sum
- 27:35of all the branches of your tree,
- 27:37and this is called the minimum spanning tree.
- 27:40So the assumption is that the minimum
- 27:42spanning tree is the correct tree that
- 27:46models the trajectory in this data,
- 27:49and this is not always the case,
- 27:51so a warning is that it's not always
- 27:53the minimum. Spanning tree is not.
- 27:55Always the best solution and, uh, uhm.
- 27:58Once you do this,
- 27:59you have your trip and some tools
- 28:03like try to assign a route to
- 28:06this tree or or they may give you
- 28:09also the possibility to select the
- 28:12root so the user can say that this
- 28:15is the root of the tree,
- 28:16because you know that these cells
- 28:19for example are most similar to
- 28:21what you expect from stem cells,
- 28:23and so once you uh,
- 28:25once you define the root of the tree,
- 28:28then you can.
- 28:29Define and smooth your tree and you
- 28:31can calculate the time for each cell
- 28:34where the time of the cell will be
- 28:37the distance from the position of
- 28:39the cell to the root of your tree
- 28:42following the topology of the tree.
- 28:50Is this a more or less clear?
- 28:54Yes to me. OK,
- 28:57so many methods used this approach,
- 29:00so first step dimensional reduction,
- 29:02then clustering, then the construction
- 29:05of a of a tree that at the beginning
- 29:09the tree was built on the single cell,
- 29:12but that's unstable and so that
- 29:14there was a switch from the single
- 29:16cell to the to the centroids.
- 29:17Because the center is more stable,
- 29:20uh, and so if you have
- 29:22deviations of the position,
- 29:23the overall tree will remain the same,
- 29:26so it's more that.
- 29:27That's why it's more stable and
- 29:29then depending on the choice of the
- 29:32dimensionality reduction approach
- 29:33and on the tree that you build it,
- 29:36you have different outcomes and
- 29:38different trajectories and this is
- 29:41from a paper from from a paper that
- 29:44two years ago was trying to compare a
- 29:5012345678 different methods to perform basa.
- 29:53So dimensional reduction and also.
- 29:58Trajectory inference so you see, for example.
- 30:01So each row here is a different data set,
- 30:04so you have five data set from.
- 30:07The simplest are because
- 30:09it has the smallest size,
- 30:11so this is the number of cells.
- 30:14And here you have an example where you
- 30:17have almost a quarter of million of cells.
- 30:20And with the same data set,
- 30:22you try to capture the data
- 30:24set with different approaches.
- 30:26So this is principal component analysis.
- 30:28With this disease Disney,
- 30:29this is the human approach and these
- 30:32others as methods like mono that tries
- 30:34also to perform trajectory inference.
- 30:37And you see how much the same data
- 30:40set is represented in different ways
- 30:43depending on the method that you use.
- 30:47And what you see here is also
- 30:49that when the data set is bigger,
- 30:51some of the methods do not even finish.
- 30:54So because the time is too much,
- 30:56and so you're not sure that some, uh,
- 30:59iterative methods that require a lot
- 31:02of steps converge because you finish,
- 31:05for example,
- 31:06the memory or the available time before
- 31:08they complete the necessary number of steps.
- 31:11And that's what you see here.
- 31:12But the point here is that, as usual,
- 31:15uh, depending on the choice.
- 31:17Of the tool.
- 31:18You have a different representation
- 31:20of the same data.
- 31:24And there is no clear way a automatic
- 31:27way to understand which is better.
- 31:32These also are some guidelines
- 31:34for selecting a like your best
- 31:37trajectory analysis tools based on.
- 31:39Also on the fact that some tools, uh,
- 31:43make assumptions on the trajectory.
- 31:46So some tools for example are only trying to
- 31:49model linear trajectories without branches.
- 31:53Some tools allow branches,
- 31:55but only bifurcation,
- 31:57so you can only have two choices
- 31:59when you have a decision.
- 32:01Some tools allow also the.
- 32:03Multiplication so from one, uh,
- 32:06let's say from one crossroads that you
- 32:09have multiple roads, possible roads,
- 32:11and some of the tools also allow you to have
- 32:14cycles inside the inside your trajectories.
- 32:17So depending on your assumptions,
- 32:20uh, these are different families
- 32:23of methods that you can use.
- 32:26And that available online.
- 32:28So this is all based on again a review
- 32:31of a trajectory analysis choices.
- 32:33It's quite recent.
- 32:34So two years ago,
- 32:35so it it it captured for sure it contains
- 32:38for sure the most popular methods.
- 32:41That are also used now.
- 32:44As so, uh, one exception that is not here,
- 32:48and it's quite a.
- 32:49It's a method that had a lot of
- 32:52popularity and it's has some unique.
- 32:54It offers some unique insight,
- 32:57is called the air nevelocity and this is
- 33:02the last thing I will speak about today,
- 33:05so the paper was published three years ago,
- 33:09and so it's a method that
- 33:13analyzes a single seller.
- 33:15Using a biological insight,
- 33:19uh, that concerns splicing so.
- 33:23Uhm?
- 33:23You know that in human and in
- 33:26Elkhart excels there is the uh,
- 33:29when the RNA is transcribed,
- 33:30it has to be processed.
- 33:32One of the main steps of the processing
- 33:34is is a splicing that removes the intern.
- 33:37The entrance from your gene,
- 33:39and once the genie splice
- 33:40them is exported and so on.
- 33:42So the basic principle of RNA
- 33:44velocity is that when you perform
- 33:46single cell araneae seeker,
- 33:48you have some reeds that are
- 33:50captured from unspliced RNA and
- 33:53some reads the tag captured.
- 33:55From splicer,
- 33:56RNA and the you can distinguish between
- 34:00unspliced reads and spliced reads.
- 34:02When you align the reader to the
- 34:05to your reference genome because
- 34:07spliced reads will not contain,
- 34:10will not contain introns.
- 34:12Basically while unspliced reads
- 34:14that will contain partially
- 34:16or totally intron sequences.
- 34:18So you can divide the reeds into
- 34:20splice and and and splice them.
- 34:23So that's the basic assumption.
- 34:25And the second assumption is that you
- 34:29can calculate the ratio for each gene.
- 34:31The ratio between unspliced
- 34:34and spliced reads.
- 34:35And the assumption is that if you
- 34:39have a lot of unspliced reads,
- 34:41that means that the gene has a
- 34:44high transcription at the moment,
- 34:47and so it means that in the future
- 34:50probably that gene will be more abundant.
- 34:53In the splice the state,
- 34:55because if you are capturing a cell at
- 34:57time zero in the and you're capturing
- 35:00a lot of unspliced RNA in the future,
- 35:03that RNA will be spliced and so
- 35:05they spliced RNA will increase.
- 35:08On the opposite,
- 35:09if you have a lot of spliced
- 35:12RNA, you can predict that and not an slicer.
- 35:15Any you can predict the depth transcription
- 35:18at the moment is not shut off and
- 35:21displays there in a in the future will
- 35:25will be reduced because of degradation.
- 35:28So based on the ratio between unspliced
- 35:30and spliced that if you have a high
- 35:32proportion of a slice of unspliced St,
- 35:35you predict that in the future
- 35:37the gene will be more expressed.
- 35:39If you have no unspliced reads,
- 35:41you predict that adage in will
- 35:44be less expressed in the future.
- 35:47And so if you're measuring, uh,
- 35:49the present state of the cell using
- 35:53this concept, you can predict where
- 35:55the cell will be in the future,
- 35:58and so you can infer the position
- 36:00of the cell in the future,
- 36:03and you can connect the observed
- 36:05cell with the predicted cell,
- 36:07and that's the principle of RNA velocity.
- 36:09So the plots that you see with RNA
- 36:12velocity are these, where each cell.
- 36:15He's a, uh,
- 36:17connect is linked to an arrow and the
- 36:20arrow basically connects the cell
- 36:22with the prediction of where the cell
- 36:25will be in the future based on the
- 36:28rates of unspliced versus spliced
- 36:31genes that you capture in this cell.
- 36:34The advantage of these so these can
- 36:37be seen and also as a trajectory
- 36:39prediction tool,
- 36:40because once you see the map of all
- 36:43the errors you can visually see where
- 36:45the cells are going and you can and
- 36:48you can sort of infer traject trajectory
- 36:50and the advantage of this method is that,
- 36:52for example you don't have to
- 36:54select that that you know from the
- 36:56arrows where there is the root.
- 36:58So these arrows have a directionality
- 37:01so differently from these approaches.
- 37:04Yeah, you can link.
- 37:06You can not only have a trajectory but
- 37:08also have a trajectory with a direction.
- 37:14The problems are that arises obviously
- 37:17that all of this is based on the idea
- 37:20that you capture a lot of unspliced RNA,
- 37:24and this could be problematic,
- 37:25especially with certain libraries.
- 37:27For example, we have seen that three prime
- 37:30and enrich or 5:00 PM and then reached
- 37:33libraries do not cover all the genes,
- 37:35the gene, and so you can have
- 37:37some biases in the original paper.
- 37:39They show that this method
- 37:41works also with the 10X data.
- 37:44That are free prime based,
- 37:46but at least for the data set that they show.
- 37:50And this is part of their
- 37:52daily original paper.
- 37:53And always say this cells
- 37:56are visualized in a hour.
- 38:00Multi dimensional reduction space.
- 38:02For example, here you see again principal
- 38:05component want and principal component 2.
- 38:08Come I have two questions, yeah.
- 38:11Uhm, how would a for example splicing
- 38:15gene mutation would affect this?
- 38:18Hey you guys, your lab does do that.
- 38:21Yeah, this is a interesting.
- 38:23I don't recall that anybody, uh?
- 38:27This is interesting, obviously,
- 38:29because you expect a mutation in a splice
- 38:34factor to deviator like the the trajectory.
- 38:38I don't know if anybody did this.
- 38:41Uh, I have to say it didn't
- 38:44run like an exhaustive search,
- 38:47so it may be that in by oxide there is
- 38:49something about this but not not them.
- 38:51I I'm aware of.
- 38:54OK, so the other question is how many,
- 38:57roughly in proportion,
- 38:59how many genes can be seen or can
- 39:02be used to to project these maps?
- 39:05'cause I'm imagining only the highly
- 39:07highly expressed genes that normally
- 39:09contain introns can be used here,
- 39:11which shouldn't be that many right?
- 39:14Yeah, yes, so that's true.
- 39:15So here again in the paper they compare,
- 39:18uh, so this is mark two that that
- 39:20that this is the technology that is
- 39:22idea because you have a lot of reads.
- 39:24For each cell and you have the full
- 39:26coverage and they capture 22% of
- 39:29a set of reeds that are unspliced.
- 39:33This is with the chromium, so this is 10XL.
- 39:37So yeah.
- 39:38So it seems that here the ratio is the same,
- 39:40but obviously they will be mostly
- 39:43the last team Trump and so yeah,
- 39:46so the fact that you capture more,
- 39:49uh, a selection of genes that
- 39:52are highly expressed.
- 39:53That's the underlying.
- 39:56Bias of all the analysis
- 39:57in at the single cell,
- 39:59and I assume that this is amplified
- 40:02in this sort of analysis.
- 40:05So, uh,
- 40:06but I I cannot give you a number
- 40:09of jeans because I I never used it.
- 40:13And so I don't have a in
- 40:16hand experience with this.
- 40:18K.
- 40:21But yes, for sure limitation is on the
- 40:23number of genes and on the number of, uh?
- 40:27Yeah, and the hot on the also on the length
- 40:30of the re read and how much coverage you
- 40:34have from the Poly A tail for example.
- 40:37I think that shorter jeans,
- 40:39for example, jeans with few axons
- 40:41and few interns that are shorter,
- 40:43will be also like a. More captured,
- 40:47more more present in this analysis than
- 40:50than long jeans with long introns. Yeah.
- 40:55Uhm, OK, my last slide is about,
- 40:58uh, this collection of resources
- 41:00about single cell C can alesys,
- 41:03so the website is called the single seller.
- 41:05Any tools? And, uh, uh, so this is, uh,
- 41:09uh, like the trend of the number of
- 41:12tools that you can find that in these,
- 41:14uh, collection?
- 41:15So right now they are over 1000 of
- 41:18computational tools for the analysis
- 41:21of single cell, an Ernie silica.
- 41:24And here you see the stats on the platform.
- 41:29So on the languages that are
- 41:31mainly used by these tools,
- 41:33so most of them right now are
- 41:35in our but obviously.
- 41:37Almost every every tool is either
- 41:39in R or Python.
- 41:41Uhm, then you have C++.
- 41:43Probably these are covers some
- 41:45tools that are at being.
- 41:48That that needs to be performed.
- 41:50A complicated with efficiency from
- 41:52the computational point of view,
- 41:55and then you have some tools with
- 41:58Mark Lab and others and hear what
- 42:01you see is a divide these resources
- 42:05in categories depending on so.
- 42:07Some tools cover the full pipeline
- 42:11from at least from the digital gene
- 42:14expression from once you have the
- 42:16gene expression to all these steps.
- 42:18Jenna Alesys so they mention reduction,
- 42:20clustering and so on,
- 42:22and some tools are more specific.
- 42:24So if you look at the frequency you
- 42:26have most of the tools there are about
- 42:29visualization of single cell data.
- 42:3140% are about visualization.
- 42:33Then second position you have clustering
- 42:36and dimensionality reduction.
- 42:38I didn't speak about this,
- 42:40but it's also very important if the
- 42:43integration of different data sets.
- 42:45So this means the integration
- 42:47of different single cell RNA.
- 42:49Yes,
- 42:49very much and also integration
- 42:51of multiple modalities.
- 42:53So for example one yeah there
- 42:55are a lot of techniques now that
- 42:58enables to capture for example
- 43:01the RNA levels and also the.
- 43:03Some chromatin, uh,
- 43:06open versus closed state uh,
- 43:09in the same cell,
- 43:11and so there are about how to
- 43:15integrate these multiple sources of
- 43:17information and multiple datasets.
- 43:20And then you have over there engine actors,
- 43:22differential expressions,
- 43:22so a lot of topics that we that
- 43:26we covered them in these last.
- 43:30Session so if you go and if you look at uh,
- 43:34you, you find the tool.
- 43:35Set the platform a platform,
- 43:39then the number of citations.
- 43:40For example, you can see
- 43:42which tools are more popular.
- 43:43We we we respect to others.
- 43:45If I have to make a choice and
- 43:48the advantage that is quite
- 43:51comprehensive and updated weekly.
- 44:01So.
- 44:05These eyes.