Analysis and Interpretation of single cells sequencing data – part 1 Introduction and alignment

August 25, 2021

ID6875

To CiteDCA Citation Guide

00:00Because now the the main focus of today
00:02will be the analysis and interpretation
00:05of single cell sequencing data.
00:07So we won't cover everything today
00:09and so it will take at least another.
00:14Another meeting for covering everything.
00:16But today we covered the
00:18introduction on the methodologies,
00:20some technical and experimental issues,
00:22and some issues also with the
00:24with the analysis of this data.
00:27So single cell analysis as a
00:29definition is the study of omics.
00:31At least that's what we're speaking about.
00:34Today is the study of omics so
00:37genomics transcriptomics proteomics
00:39at the single cell level.
00:40So the advantage is that these.
00:43Family of methods allowed to capture
00:46a cellular diversity of tissues with
00:49the with the single cell resolution.
00:52Uh, so they feel there is a bursting is
00:55like exploding with a lot with a number
00:59of novel experimental techniques every year.
01:02But there are also many
01:04computational challenges,
01:05so these methods,
01:06the single cell methods require the
01:09development of appropriate analysis.
01:10And so we will see that.
01:14Common workflows are an employee,
01:16some generic.
01:17For example clustering analysis that
01:20June spoke about in our first meeting.
01:23Some of the methods for the normalization,
01:26for example,
01:27or for the calculation of differential gene
01:31expression are taken from the bulk RNA seek,
01:34but it is not always the best choice.
01:38And since the field is rapidly moving,
01:41there is no gold standard I would say.
01:45In any step of the analysis,
01:47so you will find a lot of methods,
01:50a lot of applications.
01:52You can find the literature compare
01:54for each step of the analysis,
01:56alternative approaches,
01:57but there is no like gold reference
02:00that you that that you can choose.
02:02For example,
02:03there is a sort of called pipelines in
02:06the bulk RNA seek and the single cell.
02:09It's not so established.
02:13This is a comparison of the
02:15method single cell versus bulk,
02:16so in the back analysis you take a tissue
02:19or population of cells and you extract
02:21DNA from the whole population so that
02:24you mix up the RNA content in the same.
02:29Yeah, in the same container, let's say,
02:31and then when you prepare the library
02:34and you sequence DNA from the whole from
02:37from the whole population of cells.
02:40This means that you get for each library from
02:43each collection of cells from each tissue.
02:46Only one measurement.
02:47And this measurement of genes represents
02:50the average expression of these genes
02:53across all the details of your tissue.
02:56So obviously you cannot use the back
02:58command code if you want, for example,
03:01to see this cellular it originality in your
03:04tissue with a single cell analysis you yeah,
03:07you first perform a a step that
03:10is the isolation of the cells.
03:12So this is kind of tricky,
03:14especially in solid tissues because you
03:17need to mechanically separated each cell.
03:19It's easier with the liquid that issues.
03:22It's easiest, for example,
03:23when you consider the analysis of.
03:26Democratic cells and so inside each
03:28single cell you you perform the
03:30quantification of gene expression.
03:32Because you have a way to create a
03:35library where you can keep track
03:38of the cell of origin of each RNA,
03:41and so that's why,
03:42then you can quantify for each gene
03:45at the expression in each single cell,
03:48so that each cell has a distinct
03:51expression profiles.
03:51For example,
03:52this cell expresses only one gene.
03:54These other cells express.
03:57Different multiple genes and with
03:59different amounts and so that you
04:01can use that.
04:02This difference in the expression
04:05between different cells in order
04:08to see how much cells are similar
04:10to each other or different.
04:12So for example,
04:13you can perform clustering analysis of
04:16cells based on their expression profiles
04:18and also other downstream analysis.
04:21So obviously you have a richer data.
04:25That you can see and UM.
04:28And you have multiple more options
04:32in India in the final analysis.
04:36So yes,
04:36when it was launched is like there
04:39was this kind of comparison between
04:41bike and array seek that Vulcan.
04:43Alesis is like the analysis of through.
04:46It seems Moody and the single cell.
04:48It's like the analysis of a fruit
04:51salad where you can distinguish the
04:53contribution of each fluid for each
04:56fruit is a different cell type or subtype.
04:59Now the main application for single
05:01cell RNA sequencing when we're speaking
05:04about discrimination among different cells.
05:07There are multiple. I divided the.
05:09These are in two branches,
05:11so why is it so cold?
05:14The discrete analysis?
05:15So you you have a,
05:17you collect the expression,
05:18abundance of transcripts of genes
05:21inside each cell and you want to
05:23cluster cells in order to identify
05:25different cell types.
05:26For example,
05:27this cell types that compose the
05:29tissue that you're studying.
05:31So this is a discrete analysis because
05:34you are assuming that your tissue is
05:36composed by different types of cells
05:39that are clearly.
05:40Distinguishable from each other,
05:42and so these analysis has,
05:44for example, something to do
05:46with the class with clustering,
05:49because ultimately you want to
05:51identify separate clusters of
05:53cells based on their expression
05:55profile on the right question.
05:57Every question this is super
05:59relevant to hematopoiesis,
06:01so there's even controversy.
06:03I don't know that it should
06:05be a controversy about whether
06:08there are discrete cell states.
06:10Versus everything being continuous
06:12and logic tells me that there's
06:15going to be a continuous change
06:18in a bazillion different genes,
06:20because every cell is going
06:22to be slightly different.
06:24So do you have to actually ask
06:27the algorithm to analyze the
06:29data to find discrete sets versus
06:32find a continuous analysis?
06:36So I personally don't know if there is a way.
06:41If there is a tool so I never use that
06:44tool that tells you if the best analysis
06:47is discrete or continuous. OK, I think
06:50that probably if we looked at different
06:52papers where they claim it's discrete
06:54versus claiming it's continuous
06:56that we would find differences in
06:58how they analyzed it, yes,
07:00so the the priority knowledge of the
07:02sample is something that you can user,
07:05and for example if you take for
07:07example a peripheral blood,
07:09if you take single cell data sets
07:11of peripheral blood are where
07:13most of these cells are mature
07:15and already differentiated,
07:16then you see clearly that you have.
07:19Very separated discrete clusters,
07:21and so it makes more sense to perform
07:25a discrete analysis or clustering
07:27analysis if you take them on marrow
07:30or a population that isn't reached
07:32for stem cells or progenitors,
07:34then you expect to have a more continuous
07:39representation of your sample.
07:41And so this is important because
07:43whatever tool that you use,
07:46any clustering analysis will
07:47give you clustering and any like
07:50continuous analysis such as like
07:52inference of trajectory.
07:53We'll find the trajectory.
07:55So if you submit your sample to any analysis,
07:59you will obtain result,
08:01but the result can be meaningless.
08:04For example a continuous analysis
08:06can be minutes meaningless if
08:09your sample is biologically not.
08:11Uhm,
08:12for example,
08:13something that is differentiating
08:15or developing.
08:19So yeah, yeah, so yeah.
08:22And that's that's the parallel.
08:24So whenever you see they they like.
08:29Courses or tutorials on continuous analysis.
08:31That's something you need to be careful.
08:33You will always get a graph.
08:35You will always get like a sort
08:38of differentiation trees the tree,
08:39but you have to be careful because
08:42sometimes it doesn't make sense.
08:46To make the analysis at all.
08:49Because it's one of these,
08:50one of the assumption of a
08:52continuous analysis is that you
08:54have a sampling of the continuous
08:56process that you're trying to model.
08:58For example, development or differentiation.
09:00If this is not true,
09:01you don't have an assumption
09:03to do the analysis at all.
09:07So tomorrow you so so I think this
09:10is a very important question that
09:12I raised because you know in real
09:15life situations we will get samples
09:17sequenced and and how do we tell if
09:20this is reasonable or not reasonable
09:22so so just wonder if anyone has done
09:25a very careful analysis to sort of,
09:27you know something ground truth for example,
09:30you have two discrete cell states you
09:32already isolated or somehow maintained and
09:34put them into a single cell sequencing.
09:37And then you force it to assume
09:40trajectory based methodology and do
09:41cause many major artifacts or not.
09:43I think that's one of the ways
09:46to think about.
09:48Yeah, so, uh, so I'm not aware meaning
09:51that I never use the like tools that
09:55explicitly tell you which one which
09:58branch of the analysis is better.
10:01So by exploratory analysis,
10:02for example, we will see why when
10:06you do the preprocessing and then the
10:09dimensionality reduction and you have
10:12a lot like a on a hyperplane of cells.
10:16That you can like a guess,
10:18depending on the structure of your sample,
10:21whether it's more reasonable to
10:23proceed with the discrete cluster or
10:25to perform like a trajectory or boss.
10:27So sometimes, for example, if you it would,
10:30it could make sense to start with an
10:33exploratory analysis on all the cells.
10:36I think this as an example because
10:38it's on this life,
10:40so this seems to be like separate
10:43cluster of cells.
10:45It could be reasonable,
10:46then to select only this
10:48cluster within this cluster.
10:50We don't see clear subclasses,
10:51so within this cluster it may.
10:54It could make sense to perform
10:56a trajectory analysis to see if
10:58there is a continuous process,
11:00but not at the beginning
11:02taking consideration.
11:03Also, these two clusters here
11:05because they're clearly separated.
11:07So sometimes I think the that
11:09the workflow can be also mixed.
11:12So you start with all these cells,
11:14so you remove clear outlier clusters.
11:16Maybe you annotate the cluster
11:18so that you know,
11:20for example,
11:20that inside your population you have a
11:23mixture of progenitors or stem cells,
11:25and inside that cluster.
11:28Perform the trajectory analysis.
11:33I see so that would be
11:35my yes tentative answer.
11:36Now I don't know if anyone
11:38else has other suggestions.
11:46That maybe we can leave.
11:48I can find those some material
11:49for next time to see if I can
11:52answer more, like extensively.
11:55I think it's a tough question
11:56because I don't think there's
11:57a consensus necessarily in the
11:59field and people just to see.
12:00OK, if it makes sense or it doesn't
12:02make sense to their own eyes.
12:05Yep. Yeah, again I I don't know
12:07if someone is trying to to build
12:09some tools that yeah you know,
12:12yeah that like kind of quantify the.
12:16Reasonableness of each
12:18of the approaches. Yeah.
12:23But yes, it's an important distinction.
12:26Also, uh, yes. Also also later, 'cause it.
12:30Uhm, some history.
12:32So this is the first publication on
12:34single cell sequencing, so it's a 20.
12:37No sorry 12 years ago,
12:39so it was a nice seat cover.
12:42The whole transcriptome of a single
12:45cell so it was really one single cell
12:48because it was a mouse blaster that
12:51was isolated with a microscope so it
12:54was manually picked under my screw.
12:56A microscope then lies and then
12:58sequenced and together with the blaster.
13:01Also, 50 sites were also analyzed
13:03and so so basically the trick here
13:06to reach the single cell resolution
13:09was the isolation of these cells
13:11and then that the procedure was
13:14standard Lisa and then library
13:16preparation as in a as in balcony seek.
13:19But from the starting from 1 cell.
13:23So from that to the fielder,
13:26as I told you,
13:27an exploded and so in this plot here.
13:31So this is from a review that was 2018,
13:34so it was ten years after these
13:37first publication.
13:38And what you can see is the
13:40release of multiple approaches
13:42for single cell at any seeker.
13:44UM, that increase the number
13:46of cells that you can study.
13:48So obviously that one alone
13:50was a proof of concept,
13:52but the real.
13:54Single cell explosion happened
13:55when you put when you will be
13:58able to parallelize the process.
13:59So where you were able to capture
14:01a single cell expression level of
14:03first hundreds and then thousands
14:05and then millions of cells.
14:07So here you see the publication data of
14:10the techniques and the single cells.
14:12The number of single cells
14:14that were analyzed.
14:15So this is our first with only one
14:17cells and then you see that the trend
14:20is to release techniques that allow
14:22you to increase the high throughput.
14:24In terms of the number of
14:27cells that you can quantify,
14:29you can consider in each experiment.
14:33Question, yeah, I don't know if
14:35you're going to get to this.
14:36So if you are just saying never mind.
14:40As the number of cells that are
14:42being sequenced has increased,
14:44the number of reads per cell that
14:46people get and report on has decreased,
14:49and I'd like to understand is
14:51that just because that's what's
14:53convenient in terms of putting
14:54it onto an illuminous sequencer,
14:56or is there something about the various
14:59techniques where you reach the limit
15:01of your detection after X number of
15:04reads and it's not worth getting more?
15:08Yes, so there is a tradeoff
15:10in in these two parameters.
15:11One is the number of cells that you consider
15:14and the other is the number of reads
15:17there that you obtained for each cell.
15:20The trend for the techniques has been
15:23mainly to increase the number of cells,
15:27and obviously these was against.
15:29This is against the number
15:31of reads for each cell.
15:34So for example and.
15:38So let's say that the fielder and
15:40the the most popular techniques,
15:42for example Tenax, have been
15:44increasing more the number of cells,
15:46then the number of reads for each cell.
15:50Uh, these are depends on the
15:52application of the method I guess.
15:54So obviously if you're interested in
15:57the cell as your unit of interest,
15:59so if you're interested in
16:01like more cellular biology,
16:03you're interested more in capturing
16:05cells and separating cells,
16:06you're not so interesting looking
16:08with in great detail on thousands
16:10of genes that are expressed within
16:13each cell on the other side,
16:15if you're more interested,
16:16for example in the molecular
16:18biology rather than.
16:20Just separating cells so it would
16:22be more interesting to increase the
16:25depth of the sequencing in each cell.
16:29There are techniques where these is
16:32maximized and obviously the trade off
16:35is that you cannot get so many cells.
16:38As for the other method.
16:42Uhm?
16:45I. Think I'm
16:47just asking also and maybe June
16:49knows is there a maximum number
16:50of reads you want per cell?
16:53Because after that you don't
16:54get any additional information.
16:57So we will see we can measure when
17:00you reach the like the the plateau
17:03when you reach the plateau of the
17:07sequencing using tricks such as the UMI.
17:10So if you append to each
17:12read like a random barcode,
17:14you can see when a it doesn't make sense
17:17to sequence more depth because all the
17:20additional reads that you are detecting.
17:22Other PCR duplicates of what
17:25you already sequenced. Right,
17:27got it? Yeah I
17:28agree with the Tommaso.
17:29So I think there are some studies was
17:32down those kind of things before,
17:33but they were using different
17:35technologies compared to what you're
17:37going to use probably right now,
17:38so I'm not sure whether for every
17:40single technology out there,
17:42there has already been a paper published.
17:44Maybe for 10X there's already paper
17:45published on the standard procedures,
17:47but in your own data you can actually
17:49analyze yourself to see whether
17:51it's approaching saturation or not,
17:53and you can re sequence more from
17:55same library if you want to.
17:57Yeah. Yes, yes using Umm eyes
18:00and you can measure that.
18:02OK, I understand, thank you then.
18:05The same techniques,
18:06for example 10X at every
18:09release like increase or the.
18:13Increase the detection of multiple
18:15molecules inside each cell so that
18:17the saturation limit is higher and so
18:19it really depends also on the depends
18:22on the technique and then and on the
18:24version of the of the technique itself.
18:28But in general, I would say that
18:30the number of cells that you can
18:33measure has increased the more in
18:35the average of the techniques.
18:37Then the depth that then the,
18:40then the within cell depth.
18:42With an exception that they
18:44show you in your Indies slide,
18:47that is the smart speaker
18:49family of technical.
18:50So this family of techniques is
18:53the ideal family when you are not
18:56interested in capturing a lot of cells,
18:59but you want to maximize the analysis
19:02within each cell and the advantage of
19:05dictating of these techniques is that
19:08you can have 1,000,000 read for each cell.
19:11So it has a high coverage and also and
19:14it's one of the techniques that allow you
19:18to capture reads from the whole transcript.
19:21So we will see that the majority
19:23of commercial techniques,
19:25such as the 10X A.
19:28Do not allow you to cover
19:30the full transcript,
19:31but they are like three prime
19:32end or five prime.
19:34End the libraries.
19:35So that means that you can capture
19:37only the fragment that is near to the
19:39palie for the three prime end or near
19:41to the cap for the five prime end.
19:43This is one of the few methods where
19:46you can add in bulk and in most by
19:48Karen Acq can capture reads from
19:50the full transcript and this is an
19:52advantage because for example if
19:54you want to do splicing analysis,
19:56that's the only way you can.
19:58You know that that that that's the
20:01only method you can use to have a like.
20:04To perform splicing and a full
20:06splicing analysis, otherwise,
20:07you can prefer splicing analysis
20:09only on the initial exon,
20:10five prime end or on the terminal
20:13axons and also for a Allen Alesys.
20:15So analysis of variations analysis of
20:17Snips from Renee see from RNA seek
20:20A if you're a mutation of interest.
20:22If your variation of interest is
20:25inside the body of the gene and not
20:28at the five prime or the three prime.
20:31So this has all the advantages of
20:34allowing analysis within each cell
20:37that is comparable to the biker,
20:39any SQL.
20:42It has a limitation that is shared
20:44with other techniques.
20:46Is that most of the single cell
20:48techniques right now allow you to
20:51detect the only polyadenylated RNA
20:53because they're based on quality selection.
20:56And you have a low number of cells that
20:58you can sequence in each experiment,
21:01so less than 1000.
21:02Then the smart Seeker has already 3 version,
21:05so it was released first in 2012.
21:07Then here is a smart seek to release
21:09the one year after and then the
21:12latest is March 6th 3 that was
21:14released the last year.
21:16So each of these kind of increase.
21:19Then then the number of usable
21:22reads with smart smart
21:23seek two didn't allow to use the.
21:26Umm, I but Smartsilk 3 allows to to use also.
21:30Umm, I and this is a comparison
21:32between the two versions.
21:34Smartsilk 2 smartest see where you
21:36see the box block with the number of
21:40genes that are detected within each
21:42cell and as you see with the mastic
21:45tree you can foreach seller cover
21:47detector from 10,000 to 12,000 jeans.
21:50And also this number it is comparable
21:52to buy her any seek if you compare
21:55this number with the other method
21:57cells such as SYNNEX, internex.
21:59I think the average is 3 to 5000 genes for
22:03each cell when when this value is high.
22:11OK, so this is an example of high coverage,
22:14but low throughput are.
22:15On the other hand there you have methods
22:18where you have low coverage inside
22:20each cell and but high throughput and
22:23a family of these methods they are
22:25the so-called droplet based methods.
22:27These was one of the first set that was
22:31really that was released and it is the.
22:35I had the drop seeker analysis,
22:37so the principle is to isolate cells.
22:42Single cells in single droplet.
22:44Will you have your cell and you have a
22:47barcode that beats are barcoded beads.
22:49Allow you to attach a cellular barcode
22:53that is unique for each beat down
22:56and so it's unique for each cell.
22:59And so that's the trick that is
23:02used in order to, uh,
23:04associated the content of each
23:06cell with a single barcode.
23:08That is the cell barcode.
23:12Uhm it, so it allows to map 1000 or 10s of
23:16thousands of cells in the same experiments.
23:19Uhm, it's so the drops eater is
23:21only three prime end sequencing
23:24and it allows you the use of,
23:26Umm unique molecular identifiers.
23:28So I will have some slides later to
23:31show what is the meaning of that.
23:34This is the pipeline.
23:36Are the experimental pipeline
23:37of a drop seek experiment.
23:40So the the principle is twice.
23:43Let some point in a droplet,
23:45one cell with one microparticle.
23:47Inside this droplet you have the capture of
23:50the polyadenylated RNA with a polety probe,
23:53and then you have the little transcript,
23:56the reverse transcription and the generation
23:58of the C DNA and the library preparation.
24:01This is kind of similar to
24:04also buy currency approaches.
24:06Very similar to drop seek
24:08is also the 10X approach.
24:10That is the commercial
24:12development of the drug seeker,
24:14so send it to next.
24:16You have the same strategy of
24:18dividing dividing cells so
24:20that you have droplets in oil,
24:22in this case with a single cell and a
24:25single barcode with the cellular barcode.
24:28And as you can see the barcode attached
24:31to each bid have standard adapter that
24:33you can use in Illumina sequencing.
24:36You have the cellular barcode,
24:38you have the Umm I,
24:40and then you have a positive probe
24:42that is used to capture Poly a RNA.
24:47Uhm, this is to remind
24:49that different platforms,
24:50according to different strategies have
24:51different gene coverage is so smart.
24:53Seek two that we saw before
24:55has a full coverage.
24:57So if you consider these are like meta gene,
25:00we have the five prime UTR,
25:02the body of the gene,
25:03the coding sequence and the three prime UTR
25:06you have coverage of the full transcript,
25:08while with 10X or free payment
25:10method you haven't richemond only
25:12at the three prime end of the
25:14transcript with the five prime method.
25:16You haven't richemond all
25:17yet to five prime end,
25:19so you need to be careful on
25:21which library you are using,
25:22because if it is for just
25:25for gene quantification.
25:27Methods can be comparable,
25:28but if you are interested,
25:30for example in a ice form,
25:32expressions, pricing,
25:33analysis and so on only.
25:35These methods allow you to
25:37perform a complete analysis,
25:38not these ones.
25:42And this is another plot
25:43comparing the aging coverage are
25:45when you have full coverage.
25:46So this plot here is similar
25:48to plots that you could obtain
25:50from back button a seeker.
25:52This method is free prime end.
25:54There is a free payment
25:56method and so you see there.
25:58Richmond at the free
25:59prime of the transcript.
26:04Now this was for many for the technical part.
26:08Now the outlook on the computational
26:11analysis of single seller is
26:13resumed by these workflow.
26:15So most of their most popular methods
26:18that allow you to generate libraries
26:20and the result will be read the
26:23sequence with a standard platform
26:25such as illuminum some single cell
26:28methods have been published also
26:30that user full length sequencing.
26:32They think they're so probably
26:35in the future they would be.
26:39Use that more,
26:40but right now the standard is
26:42to use short read sequencing.
26:44Couple to see what self analysis so
26:46we will see how RO data are obtained
26:49and how the raw data reads can be
26:52transformed into count matrices
26:54that are similar to the count
26:56matrixes of the bike and a secret.
26:59But you have instead of having samples
27:02and jeans you have single cells and
27:04genes in your matrix and the numbers
27:07correspond to the number of reads
27:09mapping to the gene in the cell.
27:12Then there are quality control methodologies.
27:14Is well isation methodology's class ring,
27:17uh identification of trajectories.
27:19So like analysis that assume your
27:23sample your population of cells
27:26user continues and methods assume
27:28that your population is discrete.
27:31So let's start from the beginning.
27:33So usually in most of the methods row
27:35in in the row reads that you receive.
27:38There are three important parts
27:39that you have,
27:40and there are three parts of this sequence.
27:43And so the first important part
27:45is the cell barcode.
27:47So this is a an oligonucleotides there that
27:50can be like 8 to 12 or more nucleotide long.
27:53This depends on the on the technique
27:55and the so it's these sequence the
27:58cell barcode is unique for each of the bids.
28:01For example that you used.
28:04That when your cell was
28:05in the in the droplet,
28:07so it's what you use to identify the cell,
28:10meaning that one of the first step
28:12is to look at this region of the
28:14reader that correspond to the cell
28:16barcode and the group together.
28:18All the reads that have the same
28:20barcode and that's what you see here.
28:22So all the reads with the same barcode
28:25here in red belong to sell one,
28:27because this is the cell barcode
28:29of these cells and so on.
28:31So that all reads are grouped
28:33according to the value.
28:34Off the barcode.
28:36So obviously here there are some
28:39methodology to account for possible errors
28:41in the sequencing of the barcode so that.
28:45Barcodes are realized in a
28:47way that they have multiple,
28:49multiple different nucleotides,
28:50so that if you make one error only,
28:54you don't.
28:54You don't switch from one cell to another,
28:58but you need at least,
29:00for example,
29:01three errors in this sequencing.
29:03In the bar code to identify at
29:05the wrong self for the reader.
29:10The second part that is
29:12important is the so called.
29:14Umm I so this is not a while.
29:17The cell barcode is unique for each cell.
29:20The UMI is unique for each of the
29:23original molecule in your sample,
29:25and that's because there in the
29:27library preparation strategies.
29:28This is A is a non legal nucleotide
29:31that is included is appended to the.
29:35Library during the cDNA.
29:39Transgeneration before
29:40the amplification steps.
29:42So before the PCR.
29:44So that this means that.
29:47These can be used this stretch.
29:50This random bar code can be used to
29:53discriminate between PCR duplicates
29:54and the real biological duplicates.
29:57So in early seat you can expect to see
30:00two reads that are the same because
30:03they were derived from two copies,
30:06two different transcripts
30:07transcribed from the same gene.
30:09So some genes,
30:11such as ribosomal the transcript
30:12of ribosomal proteins,
30:14are expected to be in the
30:17range of 1000 to 10.
30:191000 copies in a single cell,
30:21so you can expect to have more
30:24molecules captured in your library,
30:26but these are true biological sequences
30:29because at the origin you have two
30:32discrete are different RNA molecules.
30:34This is different from a PCR
30:37duplicates because these are created
30:40during the amplification step.
30:42So that's why the,
30:44UM,
30:44I are important because this discrimination
30:46between biological duplicates and
30:48technical duplicates is really important.
30:50When you perform many
30:51rounds of amplification,
30:52and this happens when you
30:54have a low input material,
30:56such as in some tricky libraries.
30:58When your sample you have a low amount
31:01of sample and this is the case in
31:04single cell approaches because again,
31:06you're starting from the amount of RNA
31:09that is extracted from one single set.
31:11So you can imagine to have
31:14a lot of amplification.
31:16Occurring in order to detect the gene.
31:19So this is the definition of the UMI
31:22is a randomized nucleotide sequence.
31:24Again depending on the library preparation
31:27and on the technique that you use
31:29that it can be 8 nucleotide longer.
31:3112 Nook tight long of the longest,
31:34the better.
31:35It's incorporated into the C DNA and the
31:37initial steps of their native protocol.
31:40So before the amplification step,
31:42so the goal of the UMI is to
31:44distinguish between amplified copies
31:46of the same earning molecule.
31:48Because these have the same C DNA sequence.
31:51But it is and they have the same,
31:53Umm,
31:53so they are technical duplicates
31:55and they are removed.
31:57Well,
31:57what you want to keep his reads
31:59from separator marine molecules
32:00transcribed from the same gene?
32:02Because these will have the same
32:04C DNA but will have a different.
32:06Umm,
32:06I so these are biological duplicates
32:08and they are kept so they are.
32:10My is a method to reduce
32:14the amplification noise.
32:15This is a graphical example
32:17of the importance,
32:18so this is an example where
32:19you have a reference sequence,
32:21so this is a region of a gene,
32:24for example, and in your experiment,
32:26for example in the same cell
32:27you get 10 reads
32:28with identical sequence,
32:30and so that they align to the same region.
32:33So if we assume that they are all
32:35PCR duplicates, we have to remove
32:37all of them and keep only one.
32:39And that means that when we calculate
32:42the abundance of the gene, if we don't.
32:44Remove the duplicate. We will say this.
32:47Gina is his account of 10.
32:50After the duplication we would say that
32:52the gene has account of one because by
32:55by using this approach we assume that
32:58all the duplicates are PCR duplicates.
33:01If we include the UM eyes,
33:03we can separate technical
33:05from biological duplicates.
33:06So we can use the Umm I hear different.
33:10Umm eyes are different colors in
33:12order to group technical duplicates.
33:15For example,
33:15these four at these three and these two,
33:19but we keep the biological duplicates.
33:21So in the end,
33:23instead of collapsing everything,
33:24we can keep four reads because
33:27they having four different mice,
33:29they probably correspond to four different
33:31original molecules in our sample.
33:39Is everything clear? Do you? Hear me
33:43I don't have interaction. Yes, yes.
33:47Yes, OK.
33:50OK, so that's why you use the cell
33:53barcode to identify the cell you use it.
33:55Umm, I to remove technical duplicates
33:57and that's why instead of counter matrix
34:00you can find also instead of number
34:02of reads you can find in single cell
34:04experiments the number of UM eyes,
34:06because basically what you are
34:08doing you are collapsing reads,
34:10the transcribed so mapping to
34:11the same gene and with the same.
34:14Umm I. And after you do all these
34:18steps that you can, uh,
34:20you can arrive to your account
34:22metrics in the single cell.
34:23It's called digital expression matrix,
34:25because it represents the number of reads
34:29mapping to 1 gene in each of your cells.
34:32And the other all sequence data
34:34are always end in the balcony
34:36seeking the fast queue format.
34:38So that's how you receive your sequence.
34:41One of these steps,
34:42in order to quantify the expression,
34:44is that you align not the UMI and the bar
34:47code because those are only technical,
34:50but you.
34:51You align the read corresponding
34:52to your C DNA.
34:54The alignment tool for single cell
34:56RNA seek are on most all are the
34:59same as the one used for bulk RNA.
35:02Seek a.
35:02So here again,
35:04you see that there are multiple
35:06options and multiple align alignment
35:07tools for different applications.
35:09So for example,
35:10here you see the years of publication.
35:13It's not really updated.
35:15The methods that you see in red
35:17are the ones that were developed
35:20specifically for any seeker,
35:21and the star that you see here was
35:24developed like it was released in 2012.
35:27So almost ten years ago is one of the
35:30standard alignment tools for back.
35:32Kearney seek, we saw this with Everett.
35:35I think two weeks ago.
35:37It's also the most common tool.
35:40The default tools in many single
35:42cell pipelines.
35:43So almost all of them will use
35:46star or high SAT or another.
35:48Ernie Caecum, splines,
35:50aware aligner tool in order
35:51to perform the alignment,
35:53so this is not so different
35:56from the balcony sick.
35:57Also,
35:58the alignment output that you
36:00receive will be a bum file, so.
36:03This is a file where each original
36:05reader containers the information
36:07on the alignment so it contains the.
36:09This file contains the coordinate,
36:11so the chromosome and the genomic
36:13coordinator of the alignment,
36:14and you need to use this file
36:17in order to calculate the number
36:19of reads that map to each gene
36:21or each transcript.
36:23So also this is not different
36:25from the bunker and a secret.
36:27The only difference is for
36:29example you can have a different
36:31band files for each cell.
36:34Instead of only one bonfire.
36:38OK, so we what we covered so far is the
36:41first data preprocessing, so again we
36:43have cell barcode you MI and the RNA.
36:46You cluster cell according to your cluster
36:48reads according to the cell you simplify you,
36:51you remove technical duplicates and you
36:53arrive to your gene expression matrix.
36:55Your digital expression matrix.
36:56Now a big difference between account
36:59data that you can obtain it back versus
37:01single cell is what you see here.
37:03So this is a typical account matrix from
37:06a balcony sick and you can see the number.
37:09Are very high and rarely you see zero values.
37:13The single cell RNA seek are.
37:15These is what you obtain most of the time.
37:18I would say this is a very good.
37:20It is very high.
37:22Is an example with a low amount of zeros.
37:25So what you can see is that the numbers are
37:29lower and most of the values are zeros.
37:32So the fact that you have lower counts,
37:35it means that in all your analysis
37:37you will have a higher contribution
37:40of noise and this will bring you
37:42to a higher uncertainty in results.
37:45And if this is a big problem,
37:48it means that when you choose
37:50among different pipelines,
37:51you will have very different results.
37:54And but the origin of these is that.
37:59Your original values,
38:00your original quantification of expression
38:02values were generally very low and so
38:05the contribution of noise is higher.
38:07So this problem is one of the main
38:10problem in single cell RNA seek and
38:12at the moment is kind of unavoidable.
38:15I would say the second probably
38:18is you have several zeros.
38:21And some of these zeros are.
38:24Real zeros, meaning that in fact
38:26sell the gene is not expressed,
38:29so this corresponds to through
38:31biological zeros they represent
38:32the true lack of expression,
38:34but many times the zeros represent
38:36a technical lack of detection,
38:38meaning that the gene was
38:40present in your cell,
38:41but it was not detected because it was.
38:44It was not captured by your Beda,
38:47and so you don't have a way to see
38:49your gene because you didn't detect
38:52it in your library preparation.
38:54And obviously,
38:55in methods where you have a a
38:57low coverage inside each cell,
39:00the probability of these technical
39:02detection lack of detection.
39:03This is also called dropout
39:05effect is very high,
39:07so I think in the 10X approaches the dropout.
39:10If you can expect your jeans to be
39:13not detected with 80% of probability.
39:16Obviously this depends also on
39:18whether the gene is highly expressed
39:20or as low expression level.
39:22So if a gene has high expression level.
39:25Capability to be detected at least
39:27with one molecule is higher,
39:29but jeans with low expression levels,
39:31for example transcription factors,
39:32will rarely be detected,
39:34but most of the times it will be
39:36because of a detection problem,
39:38not because they are not expressed
39:40in the cell.
39:41And also this is inherent problem
39:43with a single cell data analysis.
39:45So it's very important to.
39:48Doodle
39:51uh no.
39:55No, yes, so now we wouldn't be
39:58ready at the end of the of the time.
40:01So one thing we could do is I I could
40:04continue and finish the next time.
40:09With the remaining of the analysis
40:11steps, sure, yeah, I think Tom,
40:12it's your own judgment to
40:14how you want to proceed.
40:15Like do you think it is a natural stuff?
40:17Then you can stop if you think
40:19you want to cover 5 more minutes,
40:21go ahead and do that so.
40:24I can give you like a sort of anticipation
40:27on the on the following steps.
40:29So uhm, many of these steps that.
40:33I mean, many, many of these steps II
40:36took inspiration from this review that
40:39was recently published in a true method.
40:42So it covers the main trials.
40:44So the successes and also the
40:47limitations of the computational
40:49methods for the single cell RNA
40:51seek analysis and so next time we'll
40:54cover the key preprocessing steps.
40:56We have seen this the molecular counting,
40:59but we will see.
41:01So how we can do quality control?
41:04Remove Excel said that are suspicious,
41:06for example because they are dying or
41:09because they represent the empty droplets
41:12or because they represented tablets.
41:14So doublets occur when you didn't really
41:17manage to separate physically the cells.
41:19So there for example in the
41:22same droplet you have two cells,
41:24or for some reason the cell barcode
41:27of two different cells was shared.
41:30For some technical problem,
41:32so we'll see methods to remove.
41:34Dying cell and those who tablets then there
41:37are problems related to the normalization.
41:40So how to consider how to consider
41:42the fact that you have a different
41:45read different number of reads in
41:47different cells and this is a problem
41:50because the biologically speaking you
41:52expect the cell of different types to
41:55have a different amount over in Asia,
41:58so expect some cells you have more
42:00any molecules than other cells,
42:03but most of the methods.
42:05Assume that you needed to have to
42:08have from each cell the same number of
42:11reads or UMI. And then we will see how.
42:14So how to remove a jeans that are
42:17not important in the analysis?
42:19This is very important because as
42:20you can imagine in single cell you
42:23have thousands of genes and also
42:25thousands of cells are.
42:26So your account matrix is very highly
42:29dimensional and so all of these methods
42:31try to reduce the number of cells
42:33are keeping only the high quality
42:35sales but also the number of genes.
42:38So reducing like the dimensionality
42:39of of your data.
42:46Leash.
42:48That sounds amazing. I look forward to
42:51next week. Yeah. And also next week
42:53when you see the downstream analysis
42:55that for example class ring for single
42:58cell approaches and also trajectory
43:01possibly also trajectory estimation.
43:05Yeah, thanks so much and nothing really.
43:07Want to comment. Is your your slides
43:09look very beautiful and I wish all my
43:12slides and possibly on people in my life.
43:14There's as well looks as nice when
43:17I try and try to put a course
43:19on this so it's kind of a
43:21maybe we should use your slides as template.