Analysis and Interpretation of single cells sequencing data – part 3 Key Analysis Steps

August 25, 2021

ID6878

To CiteDCA Citation Guide

00:05So today we will finish our.
00:08Session on single cell data analysis.
00:12So far we arrived at a this step in
00:16the analysis, so after all the quality
00:20controller and the normalization,
00:22a necessary step to reduce the complexity
00:25of the data set is the reduction
00:29of dimension of dimensionality.
00:31So reduction of features.
00:33So there are two possible ways.
00:35Feature selection extract only
00:37relevant genes and also methods
00:40for dimensionality reduction.
00:42So last time we saw.
00:44Uh, the principal component analysis and
00:46other tools that are used for single cell,
00:49especially for the visualization that are
00:52they Disney method and the UMAP methods?
00:55They're both nonlinear and the graph based.
00:59So today we will see briefly
01:01the remaining steps,
01:03downstream steps of the analysis.
01:05And we will start from the clustering.
01:13Of single cell data.
01:14So we have our cells are there
01:17mapped in our low dimensional space
01:20and we want to identify clusters,
01:24meaning cells that have a similar
01:26expression signature so that they
01:29are very similar to each other.
01:31So since this problem of clustering
01:33is quite general and it was also
01:36covered by June in the first
01:38lesson of these didactic section,
01:40many of the methods that.
01:43Can be used for single cell data are
01:45the same that he covered last time.
01:48So here is what you see.
01:50Here is an example of a data set.
01:52Uh of cells,
01:54and I think they are divided because
01:58of a different developmental stage.
02:02And so these clusters that you
02:05see here are a derived are known
02:08because of the experiments was
02:11that these cells were isolated on
02:14different different differentiation
02:15States and there are already
02:18mapped in a low dimensional space.
02:20So what you see here is that they are
02:23the first two principal components.
02:26So usually clustering always per
02:28calculated on principal components
02:30or on a reduced dimensional space.
02:33This doesn't mean that you
02:34only take the first two.
02:36You have to select the
02:38first step 1020 thirty.
02:39There are methods to select for
02:44a certain number of dimensions
02:46so that you keep what could be
02:49worthy of information and you
02:51remove the lower dimensions that
02:53are associated with less violence.
02:56And the assumption is that
02:59they mostly capture noise.
03:01But here the example in the examples.
03:03Here is a simplified example,
03:05so you see only PC1 and PC2.
03:08So in order to class so we you have
03:11a the two approaches that June
03:13covered in the clustering lesson
03:16was a hierarchical clustering.
03:18So these methods try to
03:20connect progressively.
03:21A cells that are similar to each other.
03:26So also in this case if you remember
03:28there is the concept of distance.
03:30So all clustering methods are
03:32based on the fact that you have
03:34to calculate a sort of distance
03:36or similarity measure between.
03:38Pairs of cells.
03:39So for example,
03:40here you can measure the distance
03:43as the Euclidean distance in
03:45this principle component space.
03:46Otherwise you can use other
03:48measures of distance,
03:50for example correlation and so on and.
03:53The choice of the distance,
03:56the choice of the distance will
03:59change the clustering results.
04:00So these old stands true.
04:02So hierarchical clustering try
04:04to connect progressively similar
04:06entities from app until arriving
04:11to unifying everything.
04:12And then the problem of hierarchical
04:15clustering is to decide when to cut the tree.
04:18So depending on where you cut the tree,
04:20you can separate like if you cut
04:22this tree here, you separate.
04:24Two clusters here.
04:25You separate three and so on.
04:27In this example,
04:28you know that there are 12345
04:31clusters main clusters,
04:33but this information is not always a obvious.
04:37So always in clustering the number
04:39of cluster the optimal number
04:41of clusters that represent your
04:43data is always a tricky choice
04:46and ultimately subjective.
04:49The second approach that Jonah
04:51explained last in the lesson
04:53about clustering, is K means.
04:55Clustering is based on the fact that
04:59you select a priority before beginning
05:01a number of cluster that is key in
05:04which you want to divide your data,
05:07and then you apply a sort of
05:11iterative procedures that is based
05:14on the definition of a centroid.
05:16Centroid is the.
05:18Average point of a cluster
05:20so it's not real point,
05:22but it's a appointed represents
05:24the average of all the points
05:26that belong to the cluster,
05:28and so the procedure is to iteratively
05:31assign each cell to the nearest
05:34centroid until you reach convergence.
05:38So until you.
05:41In consecutive iterations,
05:42for example, you don't have any
05:45change of lab labels between sets.
05:47A family of methods of clustering
05:50methods that are widely used
05:52in a single cell approaches is
05:55they are the graph based family.
05:57So this is something that John
05:59didn't talk about, so the principle
06:02here is to build them a graph.
06:05Uh, on this space and they usually
06:08mean it is a so called key.
06:12Nearest neighbor graph.
06:13That means that for every cell
06:16you draw a line that connected the
06:19seller with a top nearest cells
06:21and that's the key parameter.
06:24So for example in this example here,
06:26this is a 10 nearest neighbor graph,
06:31so it means that each cell
06:33here is connected to the top.
06:3510 nearest excells.
06:37So after you do this basically
06:40Europe map becomes a graph and
06:44so a graph is a a set of nodes
06:47that and each node here is a
06:50cell with a set of connections.
06:52So these connections can
06:55can also be waited so.
06:57A weight can be assigned to eat
07:00connection depending on how similar
07:02the two cells are and the method.
07:05Once you build that the graph is
07:08to identify inside this graph that
07:10can be also seen as a network.
07:13Basically to identify communities
07:16so to identify inside this graph.
07:20Communities of nodes are so clusters of
07:22nodes that are highly interconnected
07:25among themselves and with low
07:29interconnections with other clusters.
07:31So obviously if like,
07:32such as in this case,
07:34you obtain not a single network
07:37but a different networks that are
07:40completely separated, it's easier.
07:43It's obviously easier to separate
07:45these three clusters,
07:46because the in the graph they don't
07:48share any connections, but sometimes the.
07:51Graph based approach.
07:53Tried tried also to cut to within uh,
07:56these networks in order to yes.
08:02Sorry it was there a question.
08:06That was veins veins.
08:07Did you have a question?
08:11Sorry, I accidentally had my microphone on.
08:15Sorry. And so these measures try to try to
08:20divide a network inside the communities,
08:23so they try to cut the networks
08:26in order to increase the density,
08:29increase the density of the chunks
08:33in which the network is divided.
08:36So here I have a slide that
08:39explains these a better,
08:41but the principle is that you built at
08:43the graph where each node is a cell
08:45connected to the nearest neighbor
08:47and then you try to identify the
08:49communities by creating the cuts
08:52inside the network and the cuts have
08:56to isolate the parts of the network so
09:00that you don't remove a lot of links
09:03and you increase the density of the links.
09:06Offer what remains so there are
09:09many approaches to to do that in
09:12the single cell pipelines,
09:13especially in the most popular
09:16methods you will see always the logon
09:19method for the community detection.
09:24The advantage of this is that
09:26many other methods for single cell
09:29analysis uses the same approach,
09:31so they first build that these
09:34graph and then they try to
09:37either a identify the clusters as
09:40communities of the interconnections.
09:42And then we will see that also.
09:44For example, many trajectory tools
09:47will use this kind of graph in
09:50order to build the trajectory.
09:55So this is for the class ring.
09:58Uhm? Another important step,
10:00once you identify the clusters is to
10:04perform A to identify the genes that
10:08are characterizing these clusters and.
10:12That means that you want to identify the
10:16so-called marker genes for each cluster.
10:19So these are ideally jeans that
10:21are expressed only in the cluster
10:23in in a single cluster and not
10:25in the in the other cluster.
10:27And here you see an example.
10:29So this is a.
10:31Representation are you map representation
10:33of the first two dimension of a single cell,
10:37analysis of peripheral blood cells.
10:41So you see there are clustering.
10:43Louvain clustering has been
10:45applied and identified,
10:47eight cluster a cluster of cells that
10:50you see here from zero to 7 and the
10:55identification of the marker Gina is a
10:58identification of genes that will be useful.
11:02In order to annotate the cluster.
11:04Uh, because for example,
11:06as you see here,
11:07if you look at the at the
11:09expression level of this gene,
11:10you you only see this kind of representation.
11:14Also in like paper or analysis of single
11:16cell you can represent the same map,
11:19but instead of coloring the cells according
11:21to the cluster you call on the cell
11:24according to the expression of 1 gene.
11:26So here are cells are Gray if the
11:30gene MS4A1 is not expressed and
11:33they are violent if the genie.
11:35Is expressed a lot and you see that
11:37this is a good marker for cluster
11:40three because you see that the marker
11:42is highly expressed only in this
11:44cluster and most of the other cells
11:47do not express these genes at all.
11:49So the task here is to is to
11:52identify genes that are like this,
11:55so genes that are highly expressed only
11:57in one cluster and not in the other.
11:59So basically this task is very similar
12:01to a task of differential expression
12:03because what you want to do is to identify.
12:07And for each cluster,
12:08genes that are differentially expressed,
12:10in particular,
12:11more expressed in the cluster.
12:13Then in all the other cells.
12:15So you divide your cell in two sets.
12:20It's belonging to the cluster
12:21and all the other cells,
12:23and then you try to identify genes
12:25that are differentially expressed,
12:26differentially expressed.
12:29So the aim of this task is to identify
12:31genes with different expression,
12:33usually among clusters of cells,
12:35and because these are important
12:37because they are cluster markers.
12:39Now this is something that also
12:41we covered that in the second
12:44lesson Everett spoke about.
12:46These are a task differential expression
12:49analysis will buy with bulk RNA seek.
12:52So the approaches them.
12:54That are used in single cell are different
12:58and this is a an area where there is no.
13:02Everything makes my favorite methods,
13:04so there are two papers that try to
13:06do a benchmark and compare different
13:08methodologies to do differential expression,
13:11but the result is that no method
13:13is better than others.
13:15And the problem is that when you consider
13:19the expression, the distribution
13:21of expression level of genes of
13:23single genes in a single cell data,
13:25they're very thorough genius.
13:27So here you see three examples
13:30of the expression densities
13:32across cells of three genes.
13:35You see that these ambience
13:3817 has these density,
13:40so that could be like a approximated
13:43with a like a normal distribution.
13:46Or less these A is a has a 2 pics
13:51one pick a one pica of cells
13:55where the gene is not expressed.
13:58This is quite common in single cell
14:01because of the dropout events.
14:03Sorry if you remember we.
14:05That's one of the main problem of
14:06single cell in a lot in a lot of cells.
14:09At the absence of a gene is not
14:11biological but it's technical,
14:12so the genie is simplest,
14:13was not captured in a library preparation.
14:16So that's why for many genes
14:18you have this situation.
14:20You in the same cluster of cells you
14:23have some cells that express the gene
14:25and stem cells that do not express the gene.
14:28And then in other in other cases,
14:30for example, you have most of the
14:32cells with the zero expression.
14:34So this means that there is no.
14:37Unique distribution that allow
14:39you to a model.
14:41The expression of genes.
14:43So that's why.
14:46Popular way of approaches for
14:48a do for doing differential
14:50expression with single cell data
14:52is to use nonparametric test.
14:55So nonparametric tests do not make
14:57any assumption on the underlying
14:59distribution of the expression.
15:02For example,
15:03probably the most used is the Wilcoxon rank.
15:06Sum tests. A lot of these tests.
15:09Do not use that.
15:10Do not compare the real values,
15:12but they compare ranks so they
15:15transform numbers into ranks.
15:16Once you order numbers from
15:19the highest to the lower.
15:22Uh,
15:23so they can be used with single cell also,
15:25because uh,
15:26when you have a lot of cells you
15:28have a lot of measurements and
15:30so these kind of tests work well
15:33when you have a lot of replicates
15:35because you can consider each cell
15:37as a replicate when you do the UM,
15:40try to establish the difference,
15:43they are problematic since they
15:46work on ranks,
15:47their problematic when you have a
15:49lot of values that are the same.
15:50So these are tide values and
15:52that's exactly what happens.
15:54With the zeros,
15:55so this could be a problem of
15:57applying nonparametric test when
15:58you have a lot of cells where
16:01the gene is not expressed at all,
16:03then the other methods are the same
16:06as the one used in bulk RNA seek,
16:09so these were the one covered by Everett H.
16:14R&D C2 and they are based on
16:18modeling the gene expression with
16:20a negative binomial distribution.
16:22And then you have a lot of methods that
16:24were developed for the single cell,
16:26and so instead of the negative
16:28binomial they used other distributions
16:31that dealer with the accessor
16:33of zeros that you have in single
16:35cell data set again.
16:37So these are the three
16:38main families that you
16:39will find. There is no clear winner or an
16:42approach that is more used than others.
16:48Uhm, so finding the marker genes,
16:51we said it was.
16:52It's very important because uh,
16:53it's necessary to understand,
16:56uh, understand that, uh,
16:59the identity of each uh, set cluster.
17:02And it's a important to label
17:05cell clusters with the cell types.
17:07So this is a problem that is
17:10called the cell type annotation.
17:13So the aim is that you want to annotate
17:15cluster with a known cell types.
17:17Depending on the system.
17:19That you are studying.
17:21So you want to.
17:22If you're speaking about that,
17:23for example,
17:24peripheral blood you want to
17:26associate a these clusters with a
17:29known population of blood cells that
17:31you find that you expect to find.
17:34So T cell B cell and so on.
17:38There are the main approaches are
17:41obviously the manual approach you look
17:44at the marker gene and you know which
17:46are the genes that should be highly
17:49expressed in each cell population,
17:51so you know which are the B cell markers,
17:53the T cell markers and you use your
17:56personal knowledge to annotate the cluster.
17:59This is probably has this has been in
18:02the past analysis of single cells.
18:05The most used method.
18:06And that's why since it's manual is
18:09based on personal knowledge is also
18:11very time consuming because you need
18:14to review all the clusters and to
18:17assign annotate each cluster manually
18:20based on your subjective knowledge.
18:23There is a big development,
18:25a huge development of automatic
18:28tools to perform this step.
18:31So to perform cell type annotation.
18:34And these automatic,
18:37uh procedures AR of can be divided into two.
18:42So there are some procedures that are
18:45based on databases of marker genes.
18:47So what they do is that what this
18:50procedure do is that they they
18:52compare the list of marker genes
18:55of each cluster with a database
18:59of marker genes that were found.
19:03Experimentally,
19:04in a known population of sensor,
19:07so the comparison is between
19:09different lists of marker gene.
19:11The advantages that you don't know
19:13you don't necessarily need another
19:15single cell data set as a reference.
19:18You just need a list of genes
19:21and we will see there are there
19:24are databases that try to.
19:27Cover all the marker genes
19:29for cell populations,
19:30at least in human and mouse.
19:33Another family of approaches require not
19:36only the unknown list of marker gene,
19:39but require a unannotated.
19:45Did expect single cell RNA seek experiments,
19:48so they they.
19:49They strategy is what is represented here.
19:52You have a query data set that is
19:54your it's your data set where you
19:57have classes but you don't have
19:59labels and then you have a reference
20:01data set so someone else already
20:03did perform their analysis of single
20:06cell and label the clusters of cells.
20:09So the strategy is to try to
20:12identify which of the clusters.
20:15Of the query data set are more similar
20:18to the reference and and and this is
20:21a problem of classification basically,
20:23so they try to classify a data set
20:25with unknown labels using a data set
20:28set of single cell with known labels
20:30and those Indies are in this family.
20:32Obviously you have many possible
20:36math methods to do this,
20:38some some of the methods are based
20:41on correlation. Try to calculate.
20:45The similarity through correlation measures.
20:47Other approaches try to use a supervised
20:51classification and methods that are
20:53commonly used in in machine learning.
20:55So this is one of the field where,
20:58like speaking about 2021 there are there are.
21:03Very huge developments and a lot of
21:05tools that are published or in either
21:09Inbox Ivorian on journals right now.
21:13And there is this metaphor that once we
21:15have a lot of once, we have a lot of.
21:20Uh, datasets that are annotated uh.
21:23These Tasker will become a
21:25such as that will become like
21:27mapping reads to unknown genome.
21:31So performing a single cell analysis
21:34within a new data set will become
21:36as simple as that because you have
21:38a lot of reference populations and
21:40so it will be easier to annotate
21:43your cell once you have a collection
21:46of references that is reliable.
21:58These are two resources to databases that
22:01collect cell type annotation markers
22:03and so this can be used to compare the
22:08markers identifying your cluster with
22:10a known collection of markers that were
22:14identified based on a single cell data.
22:17There is this sort of so if you
22:20look at flow cytometry, the. The.
22:22The best markers are considered to be the
22:26proteins that are expressed on the surface.
22:29The problem is that the transcripts of
22:32these proteins of surface markers are
22:35not always among the top expressed genes,
22:39and so they may be subjected to dropout
22:42events and so the best collection of
22:45cell markers for a transcriptomic study.
22:48So based on gene expression is different
22:50from the best collection of markers based on.
22:54Some surface proteins?
22:59And, uh, these two databases
23:01collected single cell signatures,
23:03single cell markers in different
23:06samples and tissues mainly,
23:08or human and mouse.
23:11So also looking at at comparing the
23:14two species is important to know which
23:17markers are conserved across species
23:19and which one are species specific.
23:26Then the last step for two days
23:31is the trajectory analysis.
23:34So why is clustering tries to divide
23:38the cells into discrete clusters?
23:41The idea of trajectory analysis
23:43is that you are not monitoring.
23:47You are not capturing
23:49sales in discrete states,
23:51but you're capturing a sort of continuous.
23:54Processor for example,
23:57differentiation, uh for example,
24:00yes differentiation.
24:01So these kind of methods try to place
24:05sells a longer a continuous path that
24:09represents the evolution of a process.
24:11This could be differentiation,
24:13but for example, if you imagine a another
24:16simple example is the cell cycle.
24:19And so instead of dividing
24:21cells into separate cluster,
24:22you try to construct a sort of trajectory
24:26that models this progression through a,
24:30for example, differentiation.
24:32And, uh, uh, this sort of, UM,
24:35tools, trajectory, inference,
24:37analysis are also named sometimes like a
24:40term desktop sealed the time analysis,
24:43because still the timer is a basically,
24:47a measure is an abstract measure of
24:50the progression through the process.
24:52So from when the program when the
24:55process starts to where it ends.
24:58So the important assumption is
25:00that in order to perform trajectory
25:03analysis is that we are capturing
25:05with our single cell experiments.
25:08All the snapshot of the process
25:10that we want to model and this
25:13means that we are capturing also
25:16the intermediates because all the.
25:19The analysis it spins on the
25:21assumption that we have a continuum,
25:23not discreet,
25:23and so we need to have some to
25:26capture some cells that represent
25:29the transition between,
25:30for example two differentiation.
25:35Haynes so the assumption is
25:38that we're capturing all the
25:40snapshots and we don't have holes,
25:42and so we have a lot of intermediates
25:44and the warning is that, uh,
25:46any data set, so these tools will
25:48will will capture a trajectory for
25:51each data set that you use as input.
25:54But this doesn't mean that the trajectory
25:56that you find has any biological meaning.
26:00So the common approach to do these
26:02are there are a lot of methods also
26:05for these common and simply to explain
26:08approach is are represented here so.
26:12These are is a PCA plot and here
26:15instead of only two dimensions,
26:17you see the three dimensions,
26:19so PC1PC20 PC three.
26:22Each dot is a cell.
26:25What you do is you first perform
26:27a clustering of cells with K
26:30means with graph based approach.
26:32So this depends on the on the tool and
26:35so identify these clusters of cells.
26:38Inside your population now if you remember,
26:42each cluster can be associated with
26:44a centroid where the centroid is
26:47the central point of the cluster.
26:49So it it is in the position in
26:51the mean position with respect to
26:54the elements of all the cluster.
26:56And these dots here represent the
26:59centroids of the cluster that you identified.
27:02Now what you do is you try to build a tree.
27:05That connects these centroid and
27:07those who build these three.
27:09There are many strategies.
27:11One of the most simple is to
27:14build the minimum spanning trees.
27:17So you're trying to connect all these
27:19points in a way that minimizes the length.
27:22The total length of the branch.
27:25The branches,
27:25so if you have a set of points,
27:28you can find a solution with a tree
27:32that minimizes the length of the sum
27:35of all the branches of your tree,
27:37and this is called the minimum spanning tree.
27:40So the assumption is that the minimum
27:42spanning tree is the correct tree that
27:46models the trajectory in this data,
27:49and this is not always the case,
27:51so a warning is that it's not always
27:53the minimum. Spanning tree is not.
27:55Always the best solution and, uh, uhm.
27:58Once you do this,
27:59you have your trip and some tools
28:03like try to assign a route to
28:06this tree or or they may give you
28:09also the possibility to select the
28:12root so the user can say that this
28:15is the root of the tree,
28:16because you know that these cells
28:19for example are most similar to
28:21what you expect from stem cells,
28:23and so once you uh,
28:25once you define the root of the tree,
28:28then you can.
28:29Define and smooth your tree and you
28:31can calculate the time for each cell
28:34where the time of the cell will be
28:37the distance from the position of
28:39the cell to the root of your tree
28:42following the topology of the tree.
28:50Is this a more or less clear?
28:54Yes to me. OK,
28:57so many methods used this approach,
29:00so first step dimensional reduction,
29:02then clustering, then the construction
29:05of a of a tree that at the beginning
29:09the tree was built on the single cell,
29:12but that's unstable and so that
29:14there was a switch from the single
29:16cell to the to the centroids.
29:17Because the center is more stable,
29:20uh, and so if you have
29:22deviations of the position,
29:23the overall tree will remain the same,
29:26so it's more that.
29:27That's why it's more stable and
29:29then depending on the choice of the
29:32dimensionality reduction approach
29:33and on the tree that you build it,
29:36you have different outcomes and
29:38different trajectories and this is
29:41from a paper from from a paper that
29:44two years ago was trying to compare a
29:5012345678 different methods to perform basa.
29:53So dimensional reduction and also.
29:58Trajectory inference so you see, for example.
30:01So each row here is a different data set,
30:04so you have five data set from.
30:07The simplest are because
30:09it has the smallest size,
30:11so this is the number of cells.
30:14And here you have an example where you
30:17have almost a quarter of million of cells.
30:20And with the same data set,
30:22you try to capture the data
30:24set with different approaches.
30:26So this is principal component analysis.
30:28With this disease Disney,
30:29this is the human approach and these
30:32others as methods like mono that tries
30:34also to perform trajectory inference.
30:37And you see how much the same data
30:40set is represented in different ways
30:43depending on the method that you use.
30:47And what you see here is also
30:49that when the data set is bigger,
30:51some of the methods do not even finish.
30:54So because the time is too much,
30:56and so you're not sure that some, uh,
30:59iterative methods that require a lot
31:02of steps converge because you finish,
31:05for example,
31:06the memory or the available time before
31:08they complete the necessary number of steps.
31:11And that's what you see here.
31:12But the point here is that, as usual,
31:15uh, depending on the choice.
31:17Of the tool.
31:18You have a different representation
31:20of the same data.
31:24And there is no clear way a automatic
31:27way to understand which is better.
31:32These also are some guidelines
31:34for selecting a like your best
31:37trajectory analysis tools based on.
31:39Also on the fact that some tools, uh,
31:43make assumptions on the trajectory.
31:46So some tools for example are only trying to
31:49model linear trajectories without branches.
31:53Some tools allow branches,
31:55but only bifurcation,
31:57so you can only have two choices
31:59when you have a decision.
32:01Some tools allow also the.
32:03Multiplication so from one, uh,
32:06let's say from one crossroads that you
32:09have multiple roads, possible roads,
32:11and some of the tools also allow you to have
32:14cycles inside the inside your trajectories.
32:17So depending on your assumptions,
32:20uh, these are different families
32:23of methods that you can use.
32:26And that available online.
32:28So this is all based on again a review
32:31of a trajectory analysis choices.
32:33It's quite recent.
32:34So two years ago,
32:35so it it it captured for sure it contains
32:38for sure the most popular methods.
32:41That are also used now.
32:44As so, uh, one exception that is not here,
32:48and it's quite a.
32:49It's a method that had a lot of
32:52popularity and it's has some unique.
32:54It offers some unique insight,
32:57is called the air nevelocity and this is
33:02the last thing I will speak about today,
33:05so the paper was published three years ago,
33:09and so it's a method that
33:13analyzes a single seller.
33:15Using a biological insight,
33:19uh, that concerns splicing so.
33:23Uhm?
33:23You know that in human and in
33:26Elkhart excels there is the uh,
33:29when the RNA is transcribed,
33:30it has to be processed.
33:32One of the main steps of the processing
33:34is is a splicing that removes the intern.
33:37The entrance from your gene,
33:39and once the genie splice
33:40them is exported and so on.
33:42So the basic principle of RNA
33:44velocity is that when you perform
33:46single cell araneae seeker,
33:48you have some reeds that are
33:50captured from unspliced RNA and
33:53some reads the tag captured.
33:55From splicer,
33:56RNA and the you can distinguish between
34:00unspliced reads and spliced reads.
34:02When you align the reader to the
34:05to your reference genome because
34:07spliced reads will not contain,
34:10will not contain introns.
34:12Basically while unspliced reads
34:14that will contain partially
34:16or totally intron sequences.
34:18So you can divide the reeds into
34:20splice and and and splice them.
34:23So that's the basic assumption.
34:25And the second assumption is that you
34:29can calculate the ratio for each gene.
34:31The ratio between unspliced
34:34and spliced reads.
34:35And the assumption is that if you
34:39have a lot of unspliced reads,
34:41that means that the gene has a
34:44high transcription at the moment,
34:47and so it means that in the future
34:50probably that gene will be more abundant.
34:53In the splice the state,
34:55because if you are capturing a cell at
34:57time zero in the and you're capturing
35:00a lot of unspliced RNA in the future,
35:03that RNA will be spliced and so
35:05they spliced RNA will increase.
35:08On the opposite,
35:09if you have a lot of spliced
35:12RNA, you can predict that and not an slicer.
35:15Any you can predict the depth transcription
35:18at the moment is not shut off and
35:21displays there in a in the future will
35:25will be reduced because of degradation.
35:28So based on the ratio between unspliced
35:30and spliced that if you have a high
35:32proportion of a slice of unspliced St,
35:35you predict that in the future
35:37the gene will be more expressed.
35:39If you have no unspliced reads,
35:41you predict that adage in will
35:44be less expressed in the future.
35:47And so if you're measuring, uh,
35:49the present state of the cell using
35:53this concept, you can predict where
35:55the cell will be in the future,
35:58and so you can infer the position
36:00of the cell in the future,
36:03and you can connect the observed
36:05cell with the predicted cell,
36:07and that's the principle of RNA velocity.
36:09So the plots that you see with RNA
36:12velocity are these, where each cell.
36:15He's a, uh,
36:17connect is linked to an arrow and the
36:20arrow basically connects the cell
36:22with the prediction of where the cell
36:25will be in the future based on the
36:28rates of unspliced versus spliced
36:31genes that you capture in this cell.
36:34The advantage of these so these can
36:37be seen and also as a trajectory
36:39prediction tool,
36:40because once you see the map of all
36:43the errors you can visually see where
36:45the cells are going and you can and
36:48you can sort of infer traject trajectory
36:50and the advantage of this method is that,
36:52for example you don't have to
36:54select that that you know from the
36:56arrows where there is the root.
36:58So these arrows have a directionality
37:01so differently from these approaches.
37:04Yeah, you can link.
37:06You can not only have a trajectory but
37:08also have a trajectory with a direction.
37:14The problems are that arises obviously
37:17that all of this is based on the idea
37:20that you capture a lot of unspliced RNA,
37:24and this could be problematic,
37:25especially with certain libraries.
37:27For example, we have seen that three prime
37:30and enrich or 5:00 PM and then reached
37:33libraries do not cover all the genes,
37:35the gene, and so you can have
37:37some biases in the original paper.
37:39They show that this method
37:41works also with the 10X data.
37:44That are free prime based,
37:46but at least for the data set that they show.
37:50And this is part of their
37:52daily original paper.
37:53And always say this cells
37:56are visualized in a hour.
38:00Multi dimensional reduction space.
38:02For example, here you see again principal
38:05component want and principal component 2.
38:08Come I have two questions, yeah.
38:11Uhm, how would a for example splicing
38:15gene mutation would affect this?
38:18Hey you guys, your lab does do that.
38:21Yeah, this is a interesting.
38:23I don't recall that anybody, uh?
38:27This is interesting, obviously,
38:29because you expect a mutation in a splice
38:34factor to deviator like the the trajectory.
38:38I don't know if anybody did this.
38:41Uh, I have to say it didn't
38:44run like an exhaustive search,
38:47so it may be that in by oxide there is
38:49something about this but not not them.
38:51I I'm aware of.
38:54OK, so the other question is how many,
38:57roughly in proportion,
38:59how many genes can be seen or can
39:02be used to to project these maps?
39:05'cause I'm imagining only the highly
39:07highly expressed genes that normally
39:09contain introns can be used here,
39:11which shouldn't be that many right?
39:14Yeah, yes, so that's true.
39:15So here again in the paper they compare,
39:18uh, so this is mark two that that
39:20that this is the technology that is
39:22idea because you have a lot of reads.
39:24For each cell and you have the full
39:26coverage and they capture 22% of
39:29a set of reeds that are unspliced.
39:33This is with the chromium, so this is 10XL.
39:37So yeah.
39:38So it seems that here the ratio is the same,
39:40but obviously they will be mostly
39:43the last team Trump and so yeah,
39:46so the fact that you capture more,
39:49uh, a selection of genes that
39:52are highly expressed.
39:53That's the underlying.
39:56Bias of all the analysis
39:57in at the single cell,
39:59and I assume that this is amplified
40:02in this sort of analysis.
40:05So, uh,
40:06but I I cannot give you a number
40:09of jeans because I I never used it.
40:13And so I don't have a in
40:16hand experience with this.
40:18K.
40:21But yes, for sure limitation is on the
40:23number of genes and on the number of, uh?
40:27Yeah, and the hot on the also on the length
40:30of the re read and how much coverage you
40:34have from the Poly A tail for example.
40:37I think that shorter jeans,
40:39for example, jeans with few axons
40:41and few interns that are shorter,
40:43will be also like a. More captured,
40:47more more present in this analysis than
40:50than long jeans with long introns. Yeah.
40:55Uhm, OK, my last slide is about,
40:58uh, this collection of resources
41:00about single cell C can alesys,
41:03so the website is called the single seller.
41:05Any tools? And, uh, uh, so this is, uh,
41:09uh, like the trend of the number of
41:12tools that you can find that in these,
41:14uh, collection?
41:15So right now they are over 1000 of
41:18computational tools for the analysis
41:21of single cell, an Ernie silica.
41:24And here you see the stats on the platform.
41:29So on the languages that are
41:31mainly used by these tools,
41:33so most of them right now are
41:35in our but obviously.
41:37Almost every every tool is either
41:39in R or Python.
41:41Uhm, then you have C++.
41:43Probably these are covers some
41:45tools that are at being.
41:48That that needs to be performed.
41:50A complicated with efficiency from
41:52the computational point of view,
41:55and then you have some tools with
41:58Mark Lab and others and hear what
42:01you see is a divide these resources
42:05in categories depending on so.
42:07Some tools cover the full pipeline
42:11from at least from the digital gene
42:14expression from once you have the
42:16gene expression to all these steps.
42:18Jenna Alesys so they mention reduction,
42:20clustering and so on,
42:22and some tools are more specific.
42:24So if you look at the frequency you
42:26have most of the tools there are about
42:29visualization of single cell data.
42:3140% are about visualization.
42:33Then second position you have clustering
42:36and dimensionality reduction.
42:38I didn't speak about this,
42:40but it's also very important if the
42:43integration of different data sets.
42:45So this means the integration
42:47of different single cell RNA.
42:49Yes,
42:49very much and also integration
42:51of multiple modalities.
42:53So for example one yeah there
42:55are a lot of techniques now that
42:58enables to capture for example
43:01the RNA levels and also the.
43:03Some chromatin, uh,
43:06open versus closed state uh,
43:09in the same cell,
43:11and so there are about how to
43:15integrate these multiple sources of
43:17information and multiple datasets.
43:20And then you have over there engine actors,
43:22differential expressions,
43:22so a lot of topics that we that
43:26we covered them in these last.
43:30Session so if you go and if you look at uh,
43:34you, you find the tool.
43:35Set the platform a platform,
43:39then the number of citations.
43:40For example, you can see
43:42which tools are more popular.
43:43We we we respect to others.
43:45If I have to make a choice and
43:48the advantage that is quite
43:51comprehensive and updated weekly.
44:01So.
44:05These eyes.