Skip to Main Content

Analysis and Interpretation of single cells sequencing data – part 3 Key Analysis Steps

August 25, 2021
  • 00:05So today we will finish our.
  • 00:08Session on single cell data analysis.
  • 00:12So far we arrived at a this step in
  • 00:16the analysis, so after all the quality
  • 00:20controller and the normalization,
  • 00:22a necessary step to reduce the complexity
  • 00:25of the data set is the reduction
  • 00:29of dimension of dimensionality.
  • 00:31So reduction of features.
  • 00:33So there are two possible ways.
  • 00:35Feature selection extract only
  • 00:37relevant genes and also methods
  • 00:40for dimensionality reduction.
  • 00:42So last time we saw.
  • 00:44Uh, the principal component analysis and
  • 00:46other tools that are used for single cell,
  • 00:49especially for the visualization that are
  • 00:52they Disney method and the UMAP methods?
  • 00:55They're both nonlinear and the graph based.
  • 00:59So today we will see briefly
  • 01:01the remaining steps,
  • 01:03downstream steps of the analysis.
  • 01:05And we will start from the clustering.
  • 01:13Of single cell data.
  • 01:14So we have our cells are there
  • 01:17mapped in our low dimensional space
  • 01:20and we want to identify clusters,
  • 01:24meaning cells that have a similar
  • 01:26expression signature so that they
  • 01:29are very similar to each other.
  • 01:31So since this problem of clustering
  • 01:33is quite general and it was also
  • 01:36covered by June in the first
  • 01:38lesson of these didactic section,
  • 01:40many of the methods that.
  • 01:43Can be used for single cell data are
  • 01:45the same that he covered last time.
  • 01:48So here is what you see.
  • 01:50Here is an example of a data set.
  • 01:52Uh of cells,
  • 01:54and I think they are divided because
  • 01:58of a different developmental stage.
  • 02:02And so these clusters that you
  • 02:05see here are a derived are known
  • 02:08because of the experiments was
  • 02:11that these cells were isolated on
  • 02:14different different differentiation
  • 02:15States and there are already
  • 02:18mapped in a low dimensional space.
  • 02:20So what you see here is that they are
  • 02:23the first two principal components.
  • 02:26So usually clustering always per
  • 02:28calculated on principal components
  • 02:30or on a reduced dimensional space.
  • 02:33This doesn't mean that you
  • 02:34only take the first two.
  • 02:36You have to select the
  • 02:38first step 1020 thirty.
  • 02:39There are methods to select for
  • 02:44a certain number of dimensions
  • 02:46so that you keep what could be
  • 02:49worthy of information and you
  • 02:51remove the lower dimensions that
  • 02:53are associated with less violence.
  • 02:56And the assumption is that
  • 02:59they mostly capture noise.
  • 03:01But here the example in the examples.
  • 03:03Here is a simplified example,
  • 03:05so you see only PC1 and PC2.
  • 03:08So in order to class so we you have
  • 03:11a the two approaches that June
  • 03:13covered in the clustering lesson
  • 03:16was a hierarchical clustering.
  • 03:18So these methods try to
  • 03:20connect progressively.
  • 03:21A cells that are similar to each other.
  • 03:26So also in this case if you remember
  • 03:28there is the concept of distance.
  • 03:30So all clustering methods are
  • 03:32based on the fact that you have
  • 03:34to calculate a sort of distance
  • 03:36or similarity measure between.
  • 03:38Pairs of cells.
  • 03:39So for example,
  • 03:40here you can measure the distance
  • 03:43as the Euclidean distance in
  • 03:45this principle component space.
  • 03:46Otherwise you can use other
  • 03:48measures of distance,
  • 03:50for example correlation and so on and.
  • 03:53The choice of the distance,
  • 03:56the choice of the distance will
  • 03:59change the clustering results.
  • 04:00So these old stands true.
  • 04:02So hierarchical clustering try
  • 04:04to connect progressively similar
  • 04:06entities from app until arriving
  • 04:11to unifying everything.
  • 04:12And then the problem of hierarchical
  • 04:15clustering is to decide when to cut the tree.
  • 04:18So depending on where you cut the tree,
  • 04:20you can separate like if you cut
  • 04:22this tree here, you separate.
  • 04:24Two clusters here.
  • 04:25You separate three and so on.
  • 04:27In this example,
  • 04:28you know that there are 12345
  • 04:31clusters main clusters,
  • 04:33but this information is not always a obvious.
  • 04:37So always in clustering the number
  • 04:39of cluster the optimal number
  • 04:41of clusters that represent your
  • 04:43data is always a tricky choice
  • 04:46and ultimately subjective.
  • 04:49The second approach that Jonah
  • 04:51explained last in the lesson
  • 04:53about clustering, is K means.
  • 04:55Clustering is based on the fact that
  • 04:59you select a priority before beginning
  • 05:01a number of cluster that is key in
  • 05:04which you want to divide your data,
  • 05:07and then you apply a sort of
  • 05:11iterative procedures that is based
  • 05:14on the definition of a centroid.
  • 05:16Centroid is the.
  • 05:18Average point of a cluster
  • 05:20so it's not real point,
  • 05:22but it's a appointed represents
  • 05:24the average of all the points
  • 05:26that belong to the cluster,
  • 05:28and so the procedure is to iteratively
  • 05:31assign each cell to the nearest
  • 05:34centroid until you reach convergence.
  • 05:38So until you.
  • 05:41In consecutive iterations,
  • 05:42for example, you don't have any
  • 05:45change of lab labels between sets.
  • 05:47A family of methods of clustering
  • 05:50methods that are widely used
  • 05:52in a single cell approaches is
  • 05:55they are the graph based family.
  • 05:57So this is something that John
  • 05:59didn't talk about, so the principle
  • 06:02here is to build them a graph.
  • 06:05Uh, on this space and they usually
  • 06:08mean it is a so called key.
  • 06:12Nearest neighbor graph.
  • 06:13That means that for every cell
  • 06:16you draw a line that connected the
  • 06:19seller with a top nearest cells
  • 06:21and that's the key parameter.
  • 06:24So for example in this example here,
  • 06:26this is a 10 nearest neighbor graph,
  • 06:31so it means that each cell
  • 06:33here is connected to the top.
  • 06:3510 nearest excells.
  • 06:37So after you do this basically
  • 06:40Europe map becomes a graph and
  • 06:44so a graph is a a set of nodes
  • 06:47that and each node here is a
  • 06:50cell with a set of connections.
  • 06:52So these connections can
  • 06:55can also be waited so.
  • 06:57A weight can be assigned to eat
  • 07:00connection depending on how similar
  • 07:02the two cells are and the method.
  • 07:05Once you build that the graph is
  • 07:08to identify inside this graph that
  • 07:10can be also seen as a network.
  • 07:13Basically to identify communities
  • 07:16so to identify inside this graph.
  • 07:20Communities of nodes are so clusters of
  • 07:22nodes that are highly interconnected
  • 07:25among themselves and with low
  • 07:29interconnections with other clusters.
  • 07:31So obviously if like,
  • 07:32such as in this case,
  • 07:34you obtain not a single network
  • 07:37but a different networks that are
  • 07:40completely separated, it's easier.
  • 07:43It's obviously easier to separate
  • 07:45these three clusters,
  • 07:46because the in the graph they don't
  • 07:48share any connections, but sometimes the.
  • 07:51Graph based approach.
  • 07:53Tried tried also to cut to within uh,
  • 07:56these networks in order to yes.
  • 08:02Sorry it was there a question.
  • 08:06That was veins veins.
  • 08:07Did you have a question?
  • 08:11Sorry, I accidentally had my microphone on.
  • 08:15Sorry. And so these measures try to try to
  • 08:20divide a network inside the communities,
  • 08:23so they try to cut the networks
  • 08:26in order to increase the density,
  • 08:29increase the density of the chunks
  • 08:33in which the network is divided.
  • 08:36So here I have a slide that
  • 08:39explains these a better,
  • 08:41but the principle is that you built at
  • 08:43the graph where each node is a cell
  • 08:45connected to the nearest neighbor
  • 08:47and then you try to identify the
  • 08:49communities by creating the cuts
  • 08:52inside the network and the cuts have
  • 08:56to isolate the parts of the network so
  • 09:00that you don't remove a lot of links
  • 09:03and you increase the density of the links.
  • 09:06Offer what remains so there are
  • 09:09many approaches to to do that in
  • 09:12the single cell pipelines,
  • 09:13especially in the most popular
  • 09:16methods you will see always the logon
  • 09:19method for the community detection.
  • 09:24The advantage of this is that
  • 09:26many other methods for single cell
  • 09:29analysis uses the same approach,
  • 09:31so they first build that these
  • 09:34graph and then they try to
  • 09:37either a identify the clusters as
  • 09:40communities of the interconnections.
  • 09:42And then we will see that also.
  • 09:44For example, many trajectory tools
  • 09:47will use this kind of graph in
  • 09:50order to build the trajectory.
  • 09:55So this is for the class ring.
  • 09:58Uhm? Another important step,
  • 10:00once you identify the clusters is to
  • 10:04perform A to identify the genes that
  • 10:08are characterizing these clusters and.
  • 10:12That means that you want to identify the
  • 10:16so-called marker genes for each cluster.
  • 10:19So these are ideally jeans that
  • 10:21are expressed only in the cluster
  • 10:23in in a single cluster and not
  • 10:25in the in the other cluster.
  • 10:27And here you see an example.
  • 10:29So this is a.
  • 10:31Representation are you map representation
  • 10:33of the first two dimension of a single cell,
  • 10:37analysis of peripheral blood cells.
  • 10:41So you see there are clustering.
  • 10:43Louvain clustering has been
  • 10:45applied and identified,
  • 10:47eight cluster a cluster of cells that
  • 10:50you see here from zero to 7 and the
  • 10:55identification of the marker Gina is a
  • 10:58identification of genes that will be useful.
  • 11:02In order to annotate the cluster.
  • 11:04Uh, because for example,
  • 11:06as you see here,
  • 11:07if you look at the at the
  • 11:09expression level of this gene,
  • 11:10you you only see this kind of representation.
  • 11:14Also in like paper or analysis of single
  • 11:16cell you can represent the same map,
  • 11:19but instead of coloring the cells according
  • 11:21to the cluster you call on the cell
  • 11:24according to the expression of 1 gene.
  • 11:26So here are cells are Gray if the
  • 11:30gene MS4A1 is not expressed and
  • 11:33they are violent if the genie.
  • 11:35Is expressed a lot and you see that
  • 11:37this is a good marker for cluster
  • 11:40three because you see that the marker
  • 11:42is highly expressed only in this
  • 11:44cluster and most of the other cells
  • 11:47do not express these genes at all.
  • 11:49So the task here is to is to
  • 11:52identify genes that are like this,
  • 11:55so genes that are highly expressed only
  • 11:57in one cluster and not in the other.
  • 11:59So basically this task is very similar
  • 12:01to a task of differential expression
  • 12:03because what you want to do is to identify.
  • 12:07And for each cluster,
  • 12:08genes that are differentially expressed,
  • 12:10in particular,
  • 12:11more expressed in the cluster.
  • 12:13Then in all the other cells.
  • 12:15So you divide your cell in two sets.
  • 12:20It's belonging to the cluster
  • 12:21and all the other cells,
  • 12:23and then you try to identify genes
  • 12:25that are differentially expressed,
  • 12:26differentially expressed.
  • 12:29So the aim of this task is to identify
  • 12:31genes with different expression,
  • 12:33usually among clusters of cells,
  • 12:35and because these are important
  • 12:37because they are cluster markers.
  • 12:39Now this is something that also
  • 12:41we covered that in the second
  • 12:44lesson Everett spoke about.
  • 12:46These are a task differential expression
  • 12:49analysis will buy with bulk RNA seek.
  • 12:52So the approaches them.
  • 12:54That are used in single cell are different
  • 12:58and this is a an area where there is no.
  • 13:02Everything makes my favorite methods,
  • 13:04so there are two papers that try to
  • 13:06do a benchmark and compare different
  • 13:08methodologies to do differential expression,
  • 13:11but the result is that no method
  • 13:13is better than others.
  • 13:15And the problem is that when you consider
  • 13:19the expression, the distribution
  • 13:21of expression level of genes of
  • 13:23single genes in a single cell data,
  • 13:25they're very thorough genius.
  • 13:27So here you see three examples
  • 13:30of the expression densities
  • 13:32across cells of three genes.
  • 13:35You see that these ambience
  • 13:3817 has these density,
  • 13:40so that could be like a approximated
  • 13:43with a like a normal distribution.
  • 13:46Or less these A is a has a 2 pics
  • 13:51one pick a one pica of cells
  • 13:55where the gene is not expressed.
  • 13:58This is quite common in single cell
  • 14:01because of the dropout events.
  • 14:03Sorry if you remember we.
  • 14:05That's one of the main problem of
  • 14:06single cell in a lot in a lot of cells.
  • 14:09At the absence of a gene is not
  • 14:11biological but it's technical,
  • 14:12so the genie is simplest,
  • 14:13was not captured in a library preparation.
  • 14:16So that's why for many genes
  • 14:18you have this situation.
  • 14:20You in the same cluster of cells you
  • 14:23have some cells that express the gene
  • 14:25and stem cells that do not express the gene.
  • 14:28And then in other in other cases,
  • 14:30for example, you have most of the
  • 14:32cells with the zero expression.
  • 14:34So this means that there is no.
  • 14:37Unique distribution that allow
  • 14:39you to a model.
  • 14:41The expression of genes.
  • 14:43So that's why.
  • 14:46Popular way of approaches for
  • 14:48a do for doing differential
  • 14:50expression with single cell data
  • 14:52is to use nonparametric test.
  • 14:55So nonparametric tests do not make
  • 14:57any assumption on the underlying
  • 14:59distribution of the expression.
  • 15:02For example,
  • 15:03probably the most used is the Wilcoxon rank.
  • 15:06Sum tests. A lot of these tests.
  • 15:09Do not use that.
  • 15:10Do not compare the real values,
  • 15:12but they compare ranks so they
  • 15:15transform numbers into ranks.
  • 15:16Once you order numbers from
  • 15:19the highest to the lower.
  • 15:22Uh,
  • 15:23so they can be used with single cell also,
  • 15:25because uh,
  • 15:26when you have a lot of cells you
  • 15:28have a lot of measurements and
  • 15:30so these kind of tests work well
  • 15:33when you have a lot of replicates
  • 15:35because you can consider each cell
  • 15:37as a replicate when you do the UM,
  • 15:40try to establish the difference,
  • 15:43they are problematic since they
  • 15:46work on ranks,
  • 15:47their problematic when you have a
  • 15:49lot of values that are the same.
  • 15:50So these are tide values and
  • 15:52that's exactly what happens.
  • 15:54With the zeros,
  • 15:55so this could be a problem of
  • 15:57applying nonparametric test when
  • 15:58you have a lot of cells where
  • 16:01the gene is not expressed at all,
  • 16:03then the other methods are the same
  • 16:06as the one used in bulk RNA seek,
  • 16:09so these were the one covered by Everett H.
  • 16:14R&D C2 and they are based on
  • 16:18modeling the gene expression with
  • 16:20a negative binomial distribution.
  • 16:22And then you have a lot of methods that
  • 16:24were developed for the single cell,
  • 16:26and so instead of the negative
  • 16:28binomial they used other distributions
  • 16:31that dealer with the accessor
  • 16:33of zeros that you have in single
  • 16:35cell data set again.
  • 16:37So these are the three
  • 16:38main families that you
  • 16:39will find. There is no clear winner or an
  • 16:42approach that is more used than others.
  • 16:48Uhm, so finding the marker genes,
  • 16:51we said it was.
  • 16:52It's very important because uh,
  • 16:53it's necessary to understand,
  • 16:56uh, understand that, uh,
  • 16:59the identity of each uh, set cluster.
  • 17:02And it's a important to label
  • 17:05cell clusters with the cell types.
  • 17:07So this is a problem that is
  • 17:10called the cell type annotation.
  • 17:13So the aim is that you want to annotate
  • 17:15cluster with a known cell types.
  • 17:17Depending on the system.
  • 17:19That you are studying.
  • 17:21So you want to.
  • 17:22If you're speaking about that,
  • 17:23for example,
  • 17:24peripheral blood you want to
  • 17:26associate a these clusters with a
  • 17:29known population of blood cells that
  • 17:31you find that you expect to find.
  • 17:34So T cell B cell and so on.
  • 17:38There are the main approaches are
  • 17:41obviously the manual approach you look
  • 17:44at the marker gene and you know which
  • 17:46are the genes that should be highly
  • 17:49expressed in each cell population,
  • 17:51so you know which are the B cell markers,
  • 17:53the T cell markers and you use your
  • 17:56personal knowledge to annotate the cluster.
  • 17:59This is probably has this has been in
  • 18:02the past analysis of single cells.
  • 18:05The most used method.
  • 18:06And that's why since it's manual is
  • 18:09based on personal knowledge is also
  • 18:11very time consuming because you need
  • 18:14to review all the clusters and to
  • 18:17assign annotate each cluster manually
  • 18:20based on your subjective knowledge.
  • 18:23There is a big development,
  • 18:25a huge development of automatic
  • 18:28tools to perform this step.
  • 18:31So to perform cell type annotation.
  • 18:34And these automatic,
  • 18:37uh procedures AR of can be divided into two.
  • 18:42So there are some procedures that are
  • 18:45based on databases of marker genes.
  • 18:47So what they do is that what this
  • 18:50procedure do is that they they
  • 18:52compare the list of marker genes
  • 18:55of each cluster with a database
  • 18:59of marker genes that were found.
  • 19:03Experimentally,
  • 19:04in a known population of sensor,
  • 19:07so the comparison is between
  • 19:09different lists of marker gene.
  • 19:11The advantages that you don't know
  • 19:13you don't necessarily need another
  • 19:15single cell data set as a reference.
  • 19:18You just need a list of genes
  • 19:21and we will see there are there
  • 19:24are databases that try to.
  • 19:27Cover all the marker genes
  • 19:29for cell populations,
  • 19:30at least in human and mouse.
  • 19:33Another family of approaches require not
  • 19:36only the unknown list of marker gene,
  • 19:39but require a unannotated.
  • 19:45Did expect single cell RNA seek experiments,
  • 19:48so they they.
  • 19:49They strategy is what is represented here.
  • 19:52You have a query data set that is
  • 19:54your it's your data set where you
  • 19:57have classes but you don't have
  • 19:59labels and then you have a reference
  • 20:01data set so someone else already
  • 20:03did perform their analysis of single
  • 20:06cell and label the clusters of cells.
  • 20:09So the strategy is to try to
  • 20:12identify which of the clusters.
  • 20:15Of the query data set are more similar
  • 20:18to the reference and and and this is
  • 20:21a problem of classification basically,
  • 20:23so they try to classify a data set
  • 20:25with unknown labels using a data set
  • 20:28set of single cell with known labels
  • 20:30and those Indies are in this family.
  • 20:32Obviously you have many possible
  • 20:36math methods to do this,
  • 20:38some some of the methods are based
  • 20:41on correlation. Try to calculate.
  • 20:45The similarity through correlation measures.
  • 20:47Other approaches try to use a supervised
  • 20:51classification and methods that are
  • 20:53commonly used in in machine learning.
  • 20:55So this is one of the field where,
  • 20:58like speaking about 2021 there are there are.
  • 21:03Very huge developments and a lot of
  • 21:05tools that are published or in either
  • 21:09Inbox Ivorian on journals right now.
  • 21:13And there is this metaphor that once we
  • 21:15have a lot of once, we have a lot of.
  • 21:20Uh, datasets that are annotated uh.
  • 21:23These Tasker will become a
  • 21:25such as that will become like
  • 21:27mapping reads to unknown genome.
  • 21:31So performing a single cell analysis
  • 21:34within a new data set will become
  • 21:36as simple as that because you have
  • 21:38a lot of reference populations and
  • 21:40so it will be easier to annotate
  • 21:43your cell once you have a collection
  • 21:46of references that is reliable.
  • 21:58These are two resources to databases that
  • 22:01collect cell type annotation markers
  • 22:03and so this can be used to compare the
  • 22:08markers identifying your cluster with
  • 22:10a known collection of markers that were
  • 22:14identified based on a single cell data.
  • 22:17There is this sort of so if you
  • 22:20look at flow cytometry, the. The.
  • 22:22The best markers are considered to be the
  • 22:26proteins that are expressed on the surface.
  • 22:29The problem is that the transcripts of
  • 22:32these proteins of surface markers are
  • 22:35not always among the top expressed genes,
  • 22:39and so they may be subjected to dropout
  • 22:42events and so the best collection of
  • 22:45cell markers for a transcriptomic study.
  • 22:48So based on gene expression is different
  • 22:50from the best collection of markers based on.
  • 22:54Some surface proteins?
  • 22:59And, uh, these two databases
  • 23:01collected single cell signatures,
  • 23:03single cell markers in different
  • 23:06samples and tissues mainly,
  • 23:08or human and mouse.
  • 23:11So also looking at at comparing the
  • 23:14two species is important to know which
  • 23:17markers are conserved across species
  • 23:19and which one are species specific.
  • 23:26Then the last step for two days
  • 23:31is the trajectory analysis.
  • 23:34So why is clustering tries to divide
  • 23:38the cells into discrete clusters?
  • 23:41The idea of trajectory analysis
  • 23:43is that you are not monitoring.
  • 23:47You are not capturing
  • 23:49sales in discrete states,
  • 23:51but you're capturing a sort of continuous.
  • 23:54Processor for example,
  • 23:57differentiation, uh for example,
  • 24:00yes differentiation.
  • 24:01So these kind of methods try to place
  • 24:05sells a longer a continuous path that
  • 24:09represents the evolution of a process.
  • 24:11This could be differentiation,
  • 24:13but for example, if you imagine a another
  • 24:16simple example is the cell cycle.
  • 24:19And so instead of dividing
  • 24:21cells into separate cluster,
  • 24:22you try to construct a sort of trajectory
  • 24:26that models this progression through a,
  • 24:30for example, differentiation.
  • 24:32And, uh, uh, this sort of, UM,
  • 24:35tools, trajectory, inference,
  • 24:37analysis are also named sometimes like a
  • 24:40term desktop sealed the time analysis,
  • 24:43because still the timer is a basically,
  • 24:47a measure is an abstract measure of
  • 24:50the progression through the process.
  • 24:52So from when the program when the
  • 24:55process starts to where it ends.
  • 24:58So the important assumption is
  • 25:00that in order to perform trajectory
  • 25:03analysis is that we are capturing
  • 25:05with our single cell experiments.
  • 25:08All the snapshot of the process
  • 25:10that we want to model and this
  • 25:13means that we are capturing also
  • 25:16the intermediates because all the.
  • 25:19The analysis it spins on the
  • 25:21assumption that we have a continuum,
  • 25:23not discreet,
  • 25:23and so we need to have some to
  • 25:26capture some cells that represent
  • 25:29the transition between,
  • 25:30for example two differentiation.
  • 25:35Haynes so the assumption is
  • 25:38that we're capturing all the
  • 25:40snapshots and we don't have holes,
  • 25:42and so we have a lot of intermediates
  • 25:44and the warning is that, uh,
  • 25:46any data set, so these tools will
  • 25:48will will capture a trajectory for
  • 25:51each data set that you use as input.
  • 25:54But this doesn't mean that the trajectory
  • 25:56that you find has any biological meaning.
  • 26:00So the common approach to do these
  • 26:02are there are a lot of methods also
  • 26:05for these common and simply to explain
  • 26:08approach is are represented here so.
  • 26:12These are is a PCA plot and here
  • 26:15instead of only two dimensions,
  • 26:17you see the three dimensions,
  • 26:19so PC1PC20 PC three.
  • 26:22Each dot is a cell.
  • 26:25What you do is you first perform
  • 26:27a clustering of cells with K
  • 26:30means with graph based approach.
  • 26:32So this depends on the on the tool and
  • 26:35so identify these clusters of cells.
  • 26:38Inside your population now if you remember,
  • 26:42each cluster can be associated with
  • 26:44a centroid where the centroid is
  • 26:47the central point of the cluster.
  • 26:49So it it is in the position in
  • 26:51the mean position with respect to
  • 26:54the elements of all the cluster.
  • 26:56And these dots here represent the
  • 26:59centroids of the cluster that you identified.
  • 27:02Now what you do is you try to build a tree.
  • 27:05That connects these centroid and
  • 27:07those who build these three.
  • 27:09There are many strategies.
  • 27:11One of the most simple is to
  • 27:14build the minimum spanning trees.
  • 27:17So you're trying to connect all these
  • 27:19points in a way that minimizes the length.
  • 27:22The total length of the branch.
  • 27:25The branches,
  • 27:25so if you have a set of points,
  • 27:28you can find a solution with a tree
  • 27:32that minimizes the length of the sum
  • 27:35of all the branches of your tree,
  • 27:37and this is called the minimum spanning tree.
  • 27:40So the assumption is that the minimum
  • 27:42spanning tree is the correct tree that
  • 27:46models the trajectory in this data,
  • 27:49and this is not always the case,
  • 27:51so a warning is that it's not always
  • 27:53the minimum. Spanning tree is not.
  • 27:55Always the best solution and, uh, uhm.
  • 27:58Once you do this,
  • 27:59you have your trip and some tools
  • 28:03like try to assign a route to
  • 28:06this tree or or they may give you
  • 28:09also the possibility to select the
  • 28:12root so the user can say that this
  • 28:15is the root of the tree,
  • 28:16because you know that these cells
  • 28:19for example are most similar to
  • 28:21what you expect from stem cells,
  • 28:23and so once you uh,
  • 28:25once you define the root of the tree,
  • 28:28then you can.
  • 28:29Define and smooth your tree and you
  • 28:31can calculate the time for each cell
  • 28:34where the time of the cell will be
  • 28:37the distance from the position of
  • 28:39the cell to the root of your tree
  • 28:42following the topology of the tree.
  • 28:50Is this a more or less clear?
  • 28:54Yes to me. OK,
  • 28:57so many methods used this approach,
  • 29:00so first step dimensional reduction,
  • 29:02then clustering, then the construction
  • 29:05of a of a tree that at the beginning
  • 29:09the tree was built on the single cell,
  • 29:12but that's unstable and so that
  • 29:14there was a switch from the single
  • 29:16cell to the to the centroids.
  • 29:17Because the center is more stable,
  • 29:20uh, and so if you have
  • 29:22deviations of the position,
  • 29:23the overall tree will remain the same,
  • 29:26so it's more that.
  • 29:27That's why it's more stable and
  • 29:29then depending on the choice of the
  • 29:32dimensionality reduction approach
  • 29:33and on the tree that you build it,
  • 29:36you have different outcomes and
  • 29:38different trajectories and this is
  • 29:41from a paper from from a paper that
  • 29:44two years ago was trying to compare a
  • 29:5012345678 different methods to perform basa.
  • 29:53So dimensional reduction and also.
  • 29:58Trajectory inference so you see, for example.
  • 30:01So each row here is a different data set,
  • 30:04so you have five data set from.
  • 30:07The simplest are because
  • 30:09it has the smallest size,
  • 30:11so this is the number of cells.
  • 30:14And here you have an example where you
  • 30:17have almost a quarter of million of cells.
  • 30:20And with the same data set,
  • 30:22you try to capture the data
  • 30:24set with different approaches.
  • 30:26So this is principal component analysis.
  • 30:28With this disease Disney,
  • 30:29this is the human approach and these
  • 30:32others as methods like mono that tries
  • 30:34also to perform trajectory inference.
  • 30:37And you see how much the same data
  • 30:40set is represented in different ways
  • 30:43depending on the method that you use.
  • 30:47And what you see here is also
  • 30:49that when the data set is bigger,
  • 30:51some of the methods do not even finish.
  • 30:54So because the time is too much,
  • 30:56and so you're not sure that some, uh,
  • 30:59iterative methods that require a lot
  • 31:02of steps converge because you finish,
  • 31:05for example,
  • 31:06the memory or the available time before
  • 31:08they complete the necessary number of steps.
  • 31:11And that's what you see here.
  • 31:12But the point here is that, as usual,
  • 31:15uh, depending on the choice.
  • 31:17Of the tool.
  • 31:18You have a different representation
  • 31:20of the same data.
  • 31:24And there is no clear way a automatic
  • 31:27way to understand which is better.
  • 31:32These also are some guidelines
  • 31:34for selecting a like your best
  • 31:37trajectory analysis tools based on.
  • 31:39Also on the fact that some tools, uh,
  • 31:43make assumptions on the trajectory.
  • 31:46So some tools for example are only trying to
  • 31:49model linear trajectories without branches.
  • 31:53Some tools allow branches,
  • 31:55but only bifurcation,
  • 31:57so you can only have two choices
  • 31:59when you have a decision.
  • 32:01Some tools allow also the.
  • 32:03Multiplication so from one, uh,
  • 32:06let's say from one crossroads that you
  • 32:09have multiple roads, possible roads,
  • 32:11and some of the tools also allow you to have
  • 32:14cycles inside the inside your trajectories.
  • 32:17So depending on your assumptions,
  • 32:20uh, these are different families
  • 32:23of methods that you can use.
  • 32:26And that available online.
  • 32:28So this is all based on again a review
  • 32:31of a trajectory analysis choices.
  • 32:33It's quite recent.
  • 32:34So two years ago,
  • 32:35so it it it captured for sure it contains
  • 32:38for sure the most popular methods.
  • 32:41That are also used now.
  • 32:44As so, uh, one exception that is not here,
  • 32:48and it's quite a.
  • 32:49It's a method that had a lot of
  • 32:52popularity and it's has some unique.
  • 32:54It offers some unique insight,
  • 32:57is called the air nevelocity and this is
  • 33:02the last thing I will speak about today,
  • 33:05so the paper was published three years ago,
  • 33:09and so it's a method that
  • 33:13analyzes a single seller.
  • 33:15Using a biological insight,
  • 33:19uh, that concerns splicing so.
  • 33:23Uhm?
  • 33:23You know that in human and in
  • 33:26Elkhart excels there is the uh,
  • 33:29when the RNA is transcribed,
  • 33:30it has to be processed.
  • 33:32One of the main steps of the processing
  • 33:34is is a splicing that removes the intern.
  • 33:37The entrance from your gene,
  • 33:39and once the genie splice
  • 33:40them is exported and so on.
  • 33:42So the basic principle of RNA
  • 33:44velocity is that when you perform
  • 33:46single cell araneae seeker,
  • 33:48you have some reeds that are
  • 33:50captured from unspliced RNA and
  • 33:53some reads the tag captured.
  • 33:55From splicer,
  • 33:56RNA and the you can distinguish between
  • 34:00unspliced reads and spliced reads.
  • 34:02When you align the reader to the
  • 34:05to your reference genome because
  • 34:07spliced reads will not contain,
  • 34:10will not contain introns.
  • 34:12Basically while unspliced reads
  • 34:14that will contain partially
  • 34:16or totally intron sequences.
  • 34:18So you can divide the reeds into
  • 34:20splice and and and splice them.
  • 34:23So that's the basic assumption.
  • 34:25And the second assumption is that you
  • 34:29can calculate the ratio for each gene.
  • 34:31The ratio between unspliced
  • 34:34and spliced reads.
  • 34:35And the assumption is that if you
  • 34:39have a lot of unspliced reads,
  • 34:41that means that the gene has a
  • 34:44high transcription at the moment,
  • 34:47and so it means that in the future
  • 34:50probably that gene will be more abundant.
  • 34:53In the splice the state,
  • 34:55because if you are capturing a cell at
  • 34:57time zero in the and you're capturing
  • 35:00a lot of unspliced RNA in the future,
  • 35:03that RNA will be spliced and so
  • 35:05they spliced RNA will increase.
  • 35:08On the opposite,
  • 35:09if you have a lot of spliced
  • 35:12RNA, you can predict that and not an slicer.
  • 35:15Any you can predict the depth transcription
  • 35:18at the moment is not shut off and
  • 35:21displays there in a in the future will
  • 35:25will be reduced because of degradation.
  • 35:28So based on the ratio between unspliced
  • 35:30and spliced that if you have a high
  • 35:32proportion of a slice of unspliced St,
  • 35:35you predict that in the future
  • 35:37the gene will be more expressed.
  • 35:39If you have no unspliced reads,
  • 35:41you predict that adage in will
  • 35:44be less expressed in the future.
  • 35:47And so if you're measuring, uh,
  • 35:49the present state of the cell using
  • 35:53this concept, you can predict where
  • 35:55the cell will be in the future,
  • 35:58and so you can infer the position
  • 36:00of the cell in the future,
  • 36:03and you can connect the observed
  • 36:05cell with the predicted cell,
  • 36:07and that's the principle of RNA velocity.
  • 36:09So the plots that you see with RNA
  • 36:12velocity are these, where each cell.
  • 36:15He's a, uh,
  • 36:17connect is linked to an arrow and the
  • 36:20arrow basically connects the cell
  • 36:22with the prediction of where the cell
  • 36:25will be in the future based on the
  • 36:28rates of unspliced versus spliced
  • 36:31genes that you capture in this cell.
  • 36:34The advantage of these so these can
  • 36:37be seen and also as a trajectory
  • 36:39prediction tool,
  • 36:40because once you see the map of all
  • 36:43the errors you can visually see where
  • 36:45the cells are going and you can and
  • 36:48you can sort of infer traject trajectory
  • 36:50and the advantage of this method is that,
  • 36:52for example you don't have to
  • 36:54select that that you know from the
  • 36:56arrows where there is the root.
  • 36:58So these arrows have a directionality
  • 37:01so differently from these approaches.
  • 37:04Yeah, you can link.
  • 37:06You can not only have a trajectory but
  • 37:08also have a trajectory with a direction.
  • 37:14The problems are that arises obviously
  • 37:17that all of this is based on the idea
  • 37:20that you capture a lot of unspliced RNA,
  • 37:24and this could be problematic,
  • 37:25especially with certain libraries.
  • 37:27For example, we have seen that three prime
  • 37:30and enrich or 5:00 PM and then reached
  • 37:33libraries do not cover all the genes,
  • 37:35the gene, and so you can have
  • 37:37some biases in the original paper.
  • 37:39They show that this method
  • 37:41works also with the 10X data.
  • 37:44That are free prime based,
  • 37:46but at least for the data set that they show.
  • 37:50And this is part of their
  • 37:52daily original paper.
  • 37:53And always say this cells
  • 37:56are visualized in a hour.
  • 38:00Multi dimensional reduction space.
  • 38:02For example, here you see again principal
  • 38:05component want and principal component 2.
  • 38:08Come I have two questions, yeah.
  • 38:11Uhm, how would a for example splicing
  • 38:15gene mutation would affect this?
  • 38:18Hey you guys, your lab does do that.
  • 38:21Yeah, this is a interesting.
  • 38:23I don't recall that anybody, uh?
  • 38:27This is interesting, obviously,
  • 38:29because you expect a mutation in a splice
  • 38:34factor to deviator like the the trajectory.
  • 38:38I don't know if anybody did this.
  • 38:41Uh, I have to say it didn't
  • 38:44run like an exhaustive search,
  • 38:47so it may be that in by oxide there is
  • 38:49something about this but not not them.
  • 38:51I I'm aware of.
  • 38:54OK, so the other question is how many,
  • 38:57roughly in proportion,
  • 38:59how many genes can be seen or can
  • 39:02be used to to project these maps?
  • 39:05'cause I'm imagining only the highly
  • 39:07highly expressed genes that normally
  • 39:09contain introns can be used here,
  • 39:11which shouldn't be that many right?
  • 39:14Yeah, yes, so that's true.
  • 39:15So here again in the paper they compare,
  • 39:18uh, so this is mark two that that
  • 39:20that this is the technology that is
  • 39:22idea because you have a lot of reads.
  • 39:24For each cell and you have the full
  • 39:26coverage and they capture 22% of
  • 39:29a set of reeds that are unspliced.
  • 39:33This is with the chromium, so this is 10XL.
  • 39:37So yeah.
  • 39:38So it seems that here the ratio is the same,
  • 39:40but obviously they will be mostly
  • 39:43the last team Trump and so yeah,
  • 39:46so the fact that you capture more,
  • 39:49uh, a selection of genes that
  • 39:52are highly expressed.
  • 39:53That's the underlying.
  • 39:56Bias of all the analysis
  • 39:57in at the single cell,
  • 39:59and I assume that this is amplified
  • 40:02in this sort of analysis.
  • 40:05So, uh,
  • 40:06but I I cannot give you a number
  • 40:09of jeans because I I never used it.
  • 40:13And so I don't have a in
  • 40:16hand experience with this.
  • 40:18K.
  • 40:21But yes, for sure limitation is on the
  • 40:23number of genes and on the number of, uh?
  • 40:27Yeah, and the hot on the also on the length
  • 40:30of the re read and how much coverage you
  • 40:34have from the Poly A tail for example.
  • 40:37I think that shorter jeans,
  • 40:39for example, jeans with few axons
  • 40:41and few interns that are shorter,
  • 40:43will be also like a. More captured,
  • 40:47more more present in this analysis than
  • 40:50than long jeans with long introns. Yeah.
  • 40:55Uhm, OK, my last slide is about,
  • 40:58uh, this collection of resources
  • 41:00about single cell C can alesys,
  • 41:03so the website is called the single seller.
  • 41:05Any tools? And, uh, uh, so this is, uh,
  • 41:09uh, like the trend of the number of
  • 41:12tools that you can find that in these,
  • 41:14uh, collection?
  • 41:15So right now they are over 1000 of
  • 41:18computational tools for the analysis
  • 41:21of single cell, an Ernie silica.
  • 41:24And here you see the stats on the platform.
  • 41:29So on the languages that are
  • 41:31mainly used by these tools,
  • 41:33so most of them right now are
  • 41:35in our but obviously.
  • 41:37Almost every every tool is either
  • 41:39in R or Python.
  • 41:41Uhm, then you have C++.
  • 41:43Probably these are covers some
  • 41:45tools that are at being.
  • 41:48That that needs to be performed.
  • 41:50A complicated with efficiency from
  • 41:52the computational point of view,
  • 41:55and then you have some tools with
  • 41:58Mark Lab and others and hear what
  • 42:01you see is a divide these resources
  • 42:05in categories depending on so.
  • 42:07Some tools cover the full pipeline
  • 42:11from at least from the digital gene
  • 42:14expression from once you have the
  • 42:16gene expression to all these steps.
  • 42:18Jenna Alesys so they mention reduction,
  • 42:20clustering and so on,
  • 42:22and some tools are more specific.
  • 42:24So if you look at the frequency you
  • 42:26have most of the tools there are about
  • 42:29visualization of single cell data.
  • 42:3140% are about visualization.
  • 42:33Then second position you have clustering
  • 42:36and dimensionality reduction.
  • 42:38I didn't speak about this,
  • 42:40but it's also very important if the
  • 42:43integration of different data sets.
  • 42:45So this means the integration
  • 42:47of different single cell RNA.
  • 42:49Yes,
  • 42:49very much and also integration
  • 42:51of multiple modalities.
  • 42:53So for example one yeah there
  • 42:55are a lot of techniques now that
  • 42:58enables to capture for example
  • 43:01the RNA levels and also the.
  • 43:03Some chromatin, uh,
  • 43:06open versus closed state uh,
  • 43:09in the same cell,
  • 43:11and so there are about how to
  • 43:15integrate these multiple sources of
  • 43:17information and multiple datasets.
  • 43:20And then you have over there engine actors,
  • 43:22differential expressions,
  • 43:22so a lot of topics that we that
  • 43:26we covered them in these last.
  • 43:30Session so if you go and if you look at uh,
  • 43:34you, you find the tool.
  • 43:35Set the platform a platform,
  • 43:39then the number of citations.
  • 43:40For example, you can see
  • 43:42which tools are more popular.
  • 43:43We we we respect to others.
  • 43:45If I have to make a choice and
  • 43:48the advantage that is quite
  • 43:51comprehensive and updated weekly.
  • 44:01So.
  • 44:05These eyes.