Analysis and Interpretation of single cells sequencing data – part 1 Introduction and alignment
August 25, 2021Information
- ID
- 6875
- To Cite
- DCA Citation Guide
Transcript
- 00:00Because now the the main focus of today
- 00:02will be the analysis and interpretation
- 00:05of single cell sequencing data.
- 00:07So we won't cover everything today
- 00:09and so it will take at least another.
- 00:14Another meeting for covering everything.
- 00:16But today we covered the
- 00:18introduction on the methodologies,
- 00:20some technical and experimental issues,
- 00:22and some issues also with the
- 00:24with the analysis of this data.
- 00:27So single cell analysis as a
- 00:29definition is the study of omics.
- 00:31At least that's what we're speaking about.
- 00:34Today is the study of omics so
- 00:37genomics transcriptomics proteomics
- 00:39at the single cell level.
- 00:40So the advantage is that these.
- 00:43Family of methods allowed to capture
- 00:46a cellular diversity of tissues with
- 00:49the with the single cell resolution.
- 00:52Uh, so they feel there is a bursting is
- 00:55like exploding with a lot with a number
- 00:59of novel experimental techniques every year.
- 01:02But there are also many
- 01:04computational challenges,
- 01:05so these methods,
- 01:06the single cell methods require the
- 01:09development of appropriate analysis.
- 01:10And so we will see that.
- 01:14Common workflows are an employee,
- 01:16some generic.
- 01:17For example clustering analysis that
- 01:20June spoke about in our first meeting.
- 01:23Some of the methods for the normalization,
- 01:26for example,
- 01:27or for the calculation of differential gene
- 01:31expression are taken from the bulk RNA seek,
- 01:34but it is not always the best choice.
- 01:38And since the field is rapidly moving,
- 01:41there is no gold standard I would say.
- 01:45In any step of the analysis,
- 01:47so you will find a lot of methods,
- 01:50a lot of applications.
- 01:52You can find the literature compare
- 01:54for each step of the analysis,
- 01:56alternative approaches,
- 01:57but there is no like gold reference
- 02:00that you that that you can choose.
- 02:02For example,
- 02:03there is a sort of called pipelines in
- 02:06the bulk RNA seek and the single cell.
- 02:09It's not so established.
- 02:13This is a comparison of the
- 02:15method single cell versus bulk,
- 02:16so in the back analysis you take a tissue
- 02:19or population of cells and you extract
- 02:21DNA from the whole population so that
- 02:24you mix up the RNA content in the same.
- 02:29Yeah, in the same container, let's say,
- 02:31and then when you prepare the library
- 02:34and you sequence DNA from the whole from
- 02:37from the whole population of cells.
- 02:40This means that you get for each library from
- 02:43each collection of cells from each tissue.
- 02:46Only one measurement.
- 02:47And this measurement of genes represents
- 02:50the average expression of these genes
- 02:53across all the details of your tissue.
- 02:56So obviously you cannot use the back
- 02:58command code if you want, for example,
- 03:01to see this cellular it originality in your
- 03:04tissue with a single cell analysis you yeah,
- 03:07you first perform a a step that
- 03:10is the isolation of the cells.
- 03:12So this is kind of tricky,
- 03:14especially in solid tissues because you
- 03:17need to mechanically separated each cell.
- 03:19It's easier with the liquid that issues.
- 03:22It's easiest, for example,
- 03:23when you consider the analysis of.
- 03:26Democratic cells and so inside each
- 03:28single cell you you perform the
- 03:30quantification of gene expression.
- 03:32Because you have a way to create a
- 03:35library where you can keep track
- 03:38of the cell of origin of each RNA,
- 03:41and so that's why,
- 03:42then you can quantify for each gene
- 03:45at the expression in each single cell,
- 03:48so that each cell has a distinct
- 03:51expression profiles.
- 03:51For example,
- 03:52this cell expresses only one gene.
- 03:54These other cells express.
- 03:57Different multiple genes and with
- 03:59different amounts and so that you
- 04:01can use that.
- 04:02This difference in the expression
- 04:05between different cells in order
- 04:08to see how much cells are similar
- 04:10to each other or different.
- 04:12So for example,
- 04:13you can perform clustering analysis of
- 04:16cells based on their expression profiles
- 04:18and also other downstream analysis.
- 04:21So obviously you have a richer data.
- 04:25That you can see and UM.
- 04:28And you have multiple more options
- 04:32in India in the final analysis.
- 04:36So yes,
- 04:36when it was launched is like there
- 04:39was this kind of comparison between
- 04:41bike and array seek that Vulcan.
- 04:43Alesis is like the analysis of through.
- 04:46It seems Moody and the single cell.
- 04:48It's like the analysis of a fruit
- 04:51salad where you can distinguish the
- 04:53contribution of each fluid for each
- 04:56fruit is a different cell type or subtype.
- 04:59Now the main application for single
- 05:01cell RNA sequencing when we're speaking
- 05:04about discrimination among different cells.
- 05:07There are multiple. I divided the.
- 05:09These are in two branches,
- 05:11so why is it so cold?
- 05:14The discrete analysis?
- 05:15So you you have a,
- 05:17you collect the expression,
- 05:18abundance of transcripts of genes
- 05:21inside each cell and you want to
- 05:23cluster cells in order to identify
- 05:25different cell types.
- 05:26For example,
- 05:27this cell types that compose the
- 05:29tissue that you're studying.
- 05:31So this is a discrete analysis because
- 05:34you are assuming that your tissue is
- 05:36composed by different types of cells
- 05:39that are clearly.
- 05:40Distinguishable from each other,
- 05:42and so these analysis has,
- 05:44for example, something to do
- 05:46with the class with clustering,
- 05:49because ultimately you want to
- 05:51identify separate clusters of
- 05:53cells based on their expression
- 05:55profile on the right question.
- 05:57Every question this is super
- 05:59relevant to hematopoiesis,
- 06:01so there's even controversy.
- 06:03I don't know that it should
- 06:05be a controversy about whether
- 06:08there are discrete cell states.
- 06:10Versus everything being continuous
- 06:12and logic tells me that there's
- 06:15going to be a continuous change
- 06:18in a bazillion different genes,
- 06:20because every cell is going
- 06:22to be slightly different.
- 06:24So do you have to actually ask
- 06:27the algorithm to analyze the
- 06:29data to find discrete sets versus
- 06:32find a continuous analysis?
- 06:36So I personally don't know if there is a way.
- 06:41If there is a tool so I never use that
- 06:44tool that tells you if the best analysis
- 06:47is discrete or continuous. OK, I think
- 06:50that probably if we looked at different
- 06:52papers where they claim it's discrete
- 06:54versus claiming it's continuous
- 06:56that we would find differences in
- 06:58how they analyzed it, yes,
- 07:00so the the priority knowledge of the
- 07:02sample is something that you can user,
- 07:05and for example if you take for
- 07:07example a peripheral blood,
- 07:09if you take single cell data sets
- 07:11of peripheral blood are where
- 07:13most of these cells are mature
- 07:15and already differentiated,
- 07:16then you see clearly that you have.
- 07:19Very separated discrete clusters,
- 07:21and so it makes more sense to perform
- 07:25a discrete analysis or clustering
- 07:27analysis if you take them on marrow
- 07:30or a population that isn't reached
- 07:32for stem cells or progenitors,
- 07:34then you expect to have a more continuous
- 07:39representation of your sample.
- 07:41And so this is important because
- 07:43whatever tool that you use,
- 07:46any clustering analysis will
- 07:47give you clustering and any like
- 07:50continuous analysis such as like
- 07:52inference of trajectory.
- 07:53We'll find the trajectory.
- 07:55So if you submit your sample to any analysis,
- 07:59you will obtain result,
- 08:01but the result can be meaningless.
- 08:04For example a continuous analysis
- 08:06can be minutes meaningless if
- 08:09your sample is biologically not.
- 08:11Uhm,
- 08:12for example,
- 08:13something that is differentiating
- 08:15or developing.
- 08:19So yeah, yeah, so yeah.
- 08:22And that's that's the parallel.
- 08:24So whenever you see they they like.
- 08:29Courses or tutorials on continuous analysis.
- 08:31That's something you need to be careful.
- 08:33You will always get a graph.
- 08:35You will always get like a sort
- 08:38of differentiation trees the tree,
- 08:39but you have to be careful because
- 08:42sometimes it doesn't make sense.
- 08:46To make the analysis at all.
- 08:49Because it's one of these,
- 08:50one of the assumption of a
- 08:52continuous analysis is that you
- 08:54have a sampling of the continuous
- 08:56process that you're trying to model.
- 08:58For example, development or differentiation.
- 09:00If this is not true,
- 09:01you don't have an assumption
- 09:03to do the analysis at all.
- 09:07So tomorrow you so so I think this
- 09:10is a very important question that
- 09:12I raised because you know in real
- 09:15life situations we will get samples
- 09:17sequenced and and how do we tell if
- 09:20this is reasonable or not reasonable
- 09:22so so just wonder if anyone has done
- 09:25a very careful analysis to sort of,
- 09:27you know something ground truth for example,
- 09:30you have two discrete cell states you
- 09:32already isolated or somehow maintained and
- 09:34put them into a single cell sequencing.
- 09:37And then you force it to assume
- 09:40trajectory based methodology and do
- 09:41cause many major artifacts or not.
- 09:43I think that's one of the ways
- 09:46to think about.
- 09:48Yeah, so, uh, so I'm not aware meaning
- 09:51that I never use the like tools that
- 09:55explicitly tell you which one which
- 09:58branch of the analysis is better.
- 10:01So by exploratory analysis,
- 10:02for example, we will see why when
- 10:06you do the preprocessing and then the
- 10:09dimensionality reduction and you have
- 10:12a lot like a on a hyperplane of cells.
- 10:16That you can like a guess,
- 10:18depending on the structure of your sample,
- 10:21whether it's more reasonable to
- 10:23proceed with the discrete cluster or
- 10:25to perform like a trajectory or boss.
- 10:27So sometimes, for example, if you it would,
- 10:30it could make sense to start with an
- 10:33exploratory analysis on all the cells.
- 10:36I think this as an example because
- 10:38it's on this life,
- 10:40so this seems to be like separate
- 10:43cluster of cells.
- 10:45It could be reasonable,
- 10:46then to select only this
- 10:48cluster within this cluster.
- 10:50We don't see clear subclasses,
- 10:51so within this cluster it may.
- 10:54It could make sense to perform
- 10:56a trajectory analysis to see if
- 10:58there is a continuous process,
- 11:00but not at the beginning
- 11:02taking consideration.
- 11:03Also, these two clusters here
- 11:05because they're clearly separated.
- 11:07So sometimes I think the that
- 11:09the workflow can be also mixed.
- 11:12So you start with all these cells,
- 11:14so you remove clear outlier clusters.
- 11:16Maybe you annotate the cluster
- 11:18so that you know,
- 11:20for example,
- 11:20that inside your population you have a
- 11:23mixture of progenitors or stem cells,
- 11:25and inside that cluster.
- 11:28Perform the trajectory analysis.
- 11:33I see so that would be
- 11:35my yes tentative answer.
- 11:36Now I don't know if anyone
- 11:38else has other suggestions.
- 11:46That maybe we can leave.
- 11:48I can find those some material
- 11:49for next time to see if I can
- 11:52answer more, like extensively.
- 11:55I think it's a tough question
- 11:56because I don't think there's
- 11:57a consensus necessarily in the
- 11:59field and people just to see.
- 12:00OK, if it makes sense or it doesn't
- 12:02make sense to their own eyes.
- 12:05Yep. Yeah, again I I don't know
- 12:07if someone is trying to to build
- 12:09some tools that yeah you know,
- 12:12yeah that like kind of quantify the.
- 12:16Reasonableness of each
- 12:18of the approaches. Yeah.
- 12:23But yes, it's an important distinction.
- 12:26Also, uh, yes. Also also later, 'cause it.
- 12:30Uhm, some history.
- 12:32So this is the first publication on
- 12:34single cell sequencing, so it's a 20.
- 12:37No sorry 12 years ago,
- 12:39so it was a nice seat cover.
- 12:42The whole transcriptome of a single
- 12:45cell so it was really one single cell
- 12:48because it was a mouse blaster that
- 12:51was isolated with a microscope so it
- 12:54was manually picked under my screw.
- 12:56A microscope then lies and then
- 12:58sequenced and together with the blaster.
- 13:01Also, 50 sites were also analyzed
- 13:03and so so basically the trick here
- 13:06to reach the single cell resolution
- 13:09was the isolation of these cells
- 13:11and then that the procedure was
- 13:14standard Lisa and then library
- 13:16preparation as in a as in balcony seek.
- 13:19But from the starting from 1 cell.
- 13:23So from that to the fielder,
- 13:26as I told you,
- 13:27an exploded and so in this plot here.
- 13:31So this is from a review that was 2018,
- 13:34so it was ten years after these
- 13:37first publication.
- 13:38And what you can see is the
- 13:40release of multiple approaches
- 13:42for single cell at any seeker.
- 13:44UM, that increase the number
- 13:46of cells that you can study.
- 13:48So obviously that one alone
- 13:50was a proof of concept,
- 13:52but the real.
- 13:54Single cell explosion happened
- 13:55when you put when you will be
- 13:58able to parallelize the process.
- 13:59So where you were able to capture
- 14:01a single cell expression level of
- 14:03first hundreds and then thousands
- 14:05and then millions of cells.
- 14:07So here you see the publication data of
- 14:10the techniques and the single cells.
- 14:12The number of single cells
- 14:14that were analyzed.
- 14:15So this is our first with only one
- 14:17cells and then you see that the trend
- 14:20is to release techniques that allow
- 14:22you to increase the high throughput.
- 14:24In terms of the number of
- 14:27cells that you can quantify,
- 14:29you can consider in each experiment.
- 14:33Question, yeah, I don't know if
- 14:35you're going to get to this.
- 14:36So if you are just saying never mind.
- 14:40As the number of cells that are
- 14:42being sequenced has increased,
- 14:44the number of reads per cell that
- 14:46people get and report on has decreased,
- 14:49and I'd like to understand is
- 14:51that just because that's what's
- 14:53convenient in terms of putting
- 14:54it onto an illuminous sequencer,
- 14:56or is there something about the various
- 14:59techniques where you reach the limit
- 15:01of your detection after X number of
- 15:04reads and it's not worth getting more?
- 15:08Yes, so there is a tradeoff
- 15:10in in these two parameters.
- 15:11One is the number of cells that you consider
- 15:14and the other is the number of reads
- 15:17there that you obtained for each cell.
- 15:20The trend for the techniques has been
- 15:23mainly to increase the number of cells,
- 15:27and obviously these was against.
- 15:29This is against the number
- 15:31of reads for each cell.
- 15:34So for example and.
- 15:38So let's say that the fielder and
- 15:40the the most popular techniques,
- 15:42for example Tenax, have been
- 15:44increasing more the number of cells,
- 15:46then the number of reads for each cell.
- 15:50Uh, these are depends on the
- 15:52application of the method I guess.
- 15:54So obviously if you're interested in
- 15:57the cell as your unit of interest,
- 15:59so if you're interested in
- 16:01like more cellular biology,
- 16:03you're interested more in capturing
- 16:05cells and separating cells,
- 16:06you're not so interesting looking
- 16:08with in great detail on thousands
- 16:10of genes that are expressed within
- 16:13each cell on the other side,
- 16:15if you're more interested,
- 16:16for example in the molecular
- 16:18biology rather than.
- 16:20Just separating cells so it would
- 16:22be more interesting to increase the
- 16:25depth of the sequencing in each cell.
- 16:29There are techniques where these is
- 16:32maximized and obviously the trade off
- 16:35is that you cannot get so many cells.
- 16:38As for the other method.
- 16:42Uhm?
- 16:45I. Think I'm
- 16:47just asking also and maybe June
- 16:49knows is there a maximum number
- 16:50of reads you want per cell?
- 16:53Because after that you don't
- 16:54get any additional information.
- 16:57So we will see we can measure when
- 17:00you reach the like the the plateau
- 17:03when you reach the plateau of the
- 17:07sequencing using tricks such as the UMI.
- 17:10So if you append to each
- 17:12read like a random barcode,
- 17:14you can see when a it doesn't make sense
- 17:17to sequence more depth because all the
- 17:20additional reads that you are detecting.
- 17:22Other PCR duplicates of what
- 17:25you already sequenced. Right,
- 17:27got it? Yeah I
- 17:28agree with the Tommaso.
- 17:29So I think there are some studies was
- 17:32down those kind of things before,
- 17:33but they were using different
- 17:35technologies compared to what you're
- 17:37going to use probably right now,
- 17:38so I'm not sure whether for every
- 17:40single technology out there,
- 17:42there has already been a paper published.
- 17:44Maybe for 10X there's already paper
- 17:45published on the standard procedures,
- 17:47but in your own data you can actually
- 17:49analyze yourself to see whether
- 17:51it's approaching saturation or not,
- 17:53and you can re sequence more from
- 17:55same library if you want to.
- 17:57Yeah. Yes, yes using Umm eyes
- 18:00and you can measure that.
- 18:02OK, I understand, thank you then.
- 18:05The same techniques,
- 18:06for example 10X at every
- 18:09release like increase or the.
- 18:13Increase the detection of multiple
- 18:15molecules inside each cell so that
- 18:17the saturation limit is higher and so
- 18:19it really depends also on the depends
- 18:22on the technique and then and on the
- 18:24version of the of the technique itself.
- 18:28But in general, I would say that
- 18:30the number of cells that you can
- 18:33measure has increased the more in
- 18:35the average of the techniques.
- 18:37Then the depth that then the,
- 18:40then the within cell depth.
- 18:42With an exception that they
- 18:44show you in your Indies slide,
- 18:47that is the smart speaker
- 18:49family of technical.
- 18:50So this family of techniques is
- 18:53the ideal family when you are not
- 18:56interested in capturing a lot of cells,
- 18:59but you want to maximize the analysis
- 19:02within each cell and the advantage of
- 19:05dictating of these techniques is that
- 19:08you can have 1,000,000 read for each cell.
- 19:11So it has a high coverage and also and
- 19:14it's one of the techniques that allow you
- 19:18to capture reads from the whole transcript.
- 19:21So we will see that the majority
- 19:23of commercial techniques,
- 19:25such as the 10X A.
- 19:28Do not allow you to cover
- 19:30the full transcript,
- 19:31but they are like three prime
- 19:32end or five prime.
- 19:34End the libraries.
- 19:35So that means that you can capture
- 19:37only the fragment that is near to the
- 19:39palie for the three prime end or near
- 19:41to the cap for the five prime end.
- 19:43This is one of the few methods where
- 19:46you can add in bulk and in most by
- 19:48Karen Acq can capture reads from
- 19:50the full transcript and this is an
- 19:52advantage because for example if
- 19:54you want to do splicing analysis,
- 19:56that's the only way you can.
- 19:58You know that that that that's the
- 20:01only method you can use to have a like.
- 20:04To perform splicing and a full
- 20:06splicing analysis, otherwise,
- 20:07you can prefer splicing analysis
- 20:09only on the initial exon,
- 20:10five prime end or on the terminal
- 20:13axons and also for a Allen Alesys.
- 20:15So analysis of variations analysis of
- 20:17Snips from Renee see from RNA seek
- 20:20A if you're a mutation of interest.
- 20:22If your variation of interest is
- 20:25inside the body of the gene and not
- 20:28at the five prime or the three prime.
- 20:31So this has all the advantages of
- 20:34allowing analysis within each cell
- 20:37that is comparable to the biker,
- 20:39any SQL.
- 20:42It has a limitation that is shared
- 20:44with other techniques.
- 20:46Is that most of the single cell
- 20:48techniques right now allow you to
- 20:51detect the only polyadenylated RNA
- 20:53because they're based on quality selection.
- 20:56And you have a low number of cells that
- 20:58you can sequence in each experiment,
- 21:01so less than 1000.
- 21:02Then the smart Seeker has already 3 version,
- 21:05so it was released first in 2012.
- 21:07Then here is a smart seek to release
- 21:09the one year after and then the
- 21:12latest is March 6th 3 that was
- 21:14released the last year.
- 21:16So each of these kind of increase.
- 21:19Then then the number of usable
- 21:22reads with smart smart
- 21:23seek two didn't allow to use the.
- 21:26Umm, I but Smartsilk 3 allows to to use also.
- 21:30Umm, I and this is a comparison
- 21:32between the two versions.
- 21:34Smartsilk 2 smartest see where you
- 21:36see the box block with the number of
- 21:40genes that are detected within each
- 21:42cell and as you see with the mastic
- 21:45tree you can foreach seller cover
- 21:47detector from 10,000 to 12,000 jeans.
- 21:50And also this number it is comparable
- 21:52to buy her any seek if you compare
- 21:55this number with the other method
- 21:57cells such as SYNNEX, internex.
- 21:59I think the average is 3 to 5000 genes for
- 22:03each cell when when this value is high.
- 22:11OK, so this is an example of high coverage,
- 22:14but low throughput are.
- 22:15On the other hand there you have methods
- 22:18where you have low coverage inside
- 22:20each cell and but high throughput and
- 22:23a family of these methods they are
- 22:25the so-called droplet based methods.
- 22:27These was one of the first set that was
- 22:31really that was released and it is the.
- 22:35I had the drop seeker analysis,
- 22:37so the principle is to isolate cells.
- 22:42Single cells in single droplet.
- 22:44Will you have your cell and you have a
- 22:47barcode that beats are barcoded beads.
- 22:49Allow you to attach a cellular barcode
- 22:53that is unique for each beat down
- 22:56and so it's unique for each cell.
- 22:59And so that's the trick that is
- 23:02used in order to, uh,
- 23:04associated the content of each
- 23:06cell with a single barcode.
- 23:08That is the cell barcode.
- 23:12Uhm it, so it allows to map 1000 or 10s of
- 23:16thousands of cells in the same experiments.
- 23:19Uhm, it's so the drops eater is
- 23:21only three prime end sequencing
- 23:24and it allows you the use of,
- 23:26Umm unique molecular identifiers.
- 23:28So I will have some slides later to
- 23:31show what is the meaning of that.
- 23:34This is the pipeline.
- 23:36Are the experimental pipeline
- 23:37of a drop seek experiment.
- 23:40So the the principle is twice.
- 23:43Let some point in a droplet,
- 23:45one cell with one microparticle.
- 23:47Inside this droplet you have the capture of
- 23:50the polyadenylated RNA with a polety probe,
- 23:53and then you have the little transcript,
- 23:56the reverse transcription and the generation
- 23:58of the C DNA and the library preparation.
- 24:01This is kind of similar to
- 24:04also buy currency approaches.
- 24:06Very similar to drop seek
- 24:08is also the 10X approach.
- 24:10That is the commercial
- 24:12development of the drug seeker,
- 24:14so send it to next.
- 24:16You have the same strategy of
- 24:18dividing dividing cells so
- 24:20that you have droplets in oil,
- 24:22in this case with a single cell and a
- 24:25single barcode with the cellular barcode.
- 24:28And as you can see the barcode attached
- 24:31to each bid have standard adapter that
- 24:33you can use in Illumina sequencing.
- 24:36You have the cellular barcode,
- 24:38you have the Umm I,
- 24:40and then you have a positive probe
- 24:42that is used to capture Poly a RNA.
- 24:47Uhm, this is to remind
- 24:49that different platforms,
- 24:50according to different strategies have
- 24:51different gene coverage is so smart.
- 24:53Seek two that we saw before
- 24:55has a full coverage.
- 24:57So if you consider these are like meta gene,
- 25:00we have the five prime UTR,
- 25:02the body of the gene,
- 25:03the coding sequence and the three prime UTR
- 25:06you have coverage of the full transcript,
- 25:08while with 10X or free payment
- 25:10method you haven't richemond only
- 25:12at the three prime end of the
- 25:14transcript with the five prime method.
- 25:16You haven't richemond all
- 25:17yet to five prime end,
- 25:19so you need to be careful on
- 25:21which library you are using,
- 25:22because if it is for just
- 25:25for gene quantification.
- 25:27Methods can be comparable,
- 25:28but if you are interested,
- 25:30for example in a ice form,
- 25:32expressions, pricing,
- 25:33analysis and so on only.
- 25:35These methods allow you to
- 25:37perform a complete analysis,
- 25:38not these ones.
- 25:42And this is another plot
- 25:43comparing the aging coverage are
- 25:45when you have full coverage.
- 25:46So this plot here is similar
- 25:48to plots that you could obtain
- 25:50from back button a seeker.
- 25:52This method is free prime end.
- 25:54There is a free payment
- 25:56method and so you see there.
- 25:58Richmond at the free
- 25:59prime of the transcript.
- 26:04Now this was for many for the technical part.
- 26:08Now the outlook on the computational
- 26:11analysis of single seller is
- 26:13resumed by these workflow.
- 26:15So most of their most popular methods
- 26:18that allow you to generate libraries
- 26:20and the result will be read the
- 26:23sequence with a standard platform
- 26:25such as illuminum some single cell
- 26:28methods have been published also
- 26:30that user full length sequencing.
- 26:32They think they're so probably
- 26:35in the future they would be.
- 26:39Use that more,
- 26:40but right now the standard is
- 26:42to use short read sequencing.
- 26:44Couple to see what self analysis so
- 26:46we will see how RO data are obtained
- 26:49and how the raw data reads can be
- 26:52transformed into count matrices
- 26:54that are similar to the count
- 26:56matrixes of the bike and a secret.
- 26:59But you have instead of having samples
- 27:02and jeans you have single cells and
- 27:04genes in your matrix and the numbers
- 27:07correspond to the number of reads
- 27:09mapping to the gene in the cell.
- 27:12Then there are quality control methodologies.
- 27:14Is well isation methodology's class ring,
- 27:17uh identification of trajectories.
- 27:19So like analysis that assume your
- 27:23sample your population of cells
- 27:26user continues and methods assume
- 27:28that your population is discrete.
- 27:31So let's start from the beginning.
- 27:33So usually in most of the methods row
- 27:35in in the row reads that you receive.
- 27:38There are three important parts
- 27:39that you have,
- 27:40and there are three parts of this sequence.
- 27:43And so the first important part
- 27:45is the cell barcode.
- 27:47So this is a an oligonucleotides there that
- 27:50can be like 8 to 12 or more nucleotide long.
- 27:53This depends on the on the technique
- 27:55and the so it's these sequence the
- 27:58cell barcode is unique for each of the bids.
- 28:01For example that you used.
- 28:04That when your cell was
- 28:05in the in the droplet,
- 28:07so it's what you use to identify the cell,
- 28:10meaning that one of the first step
- 28:12is to look at this region of the
- 28:14reader that correspond to the cell
- 28:16barcode and the group together.
- 28:18All the reads that have the same
- 28:20barcode and that's what you see here.
- 28:22So all the reads with the same barcode
- 28:25here in red belong to sell one,
- 28:27because this is the cell barcode
- 28:29of these cells and so on.
- 28:31So that all reads are grouped
- 28:33according to the value.
- 28:34Off the barcode.
- 28:36So obviously here there are some
- 28:39methodology to account for possible errors
- 28:41in the sequencing of the barcode so that.
- 28:45Barcodes are realized in a
- 28:47way that they have multiple,
- 28:49multiple different nucleotides,
- 28:50so that if you make one error only,
- 28:54you don't.
- 28:54You don't switch from one cell to another,
- 28:58but you need at least,
- 29:00for example,
- 29:01three errors in this sequencing.
- 29:03In the bar code to identify at
- 29:05the wrong self for the reader.
- 29:10The second part that is
- 29:12important is the so called.
- 29:14Umm I so this is not a while.
- 29:17The cell barcode is unique for each cell.
- 29:20The UMI is unique for each of the
- 29:23original molecule in your sample,
- 29:25and that's because there in the
- 29:27library preparation strategies.
- 29:28This is A is a non legal nucleotide
- 29:31that is included is appended to the.
- 29:35Library during the cDNA.
- 29:39Transgeneration before
- 29:40the amplification steps.
- 29:42So before the PCR.
- 29:44So that this means that.
- 29:47These can be used this stretch.
- 29:50This random bar code can be used to
- 29:53discriminate between PCR duplicates
- 29:54and the real biological duplicates.
- 29:57So in early seat you can expect to see
- 30:00two reads that are the same because
- 30:03they were derived from two copies,
- 30:06two different transcripts
- 30:07transcribed from the same gene.
- 30:09So some genes,
- 30:11such as ribosomal the transcript
- 30:12of ribosomal proteins,
- 30:14are expected to be in the
- 30:17range of 1000 to 10.
- 30:191000 copies in a single cell,
- 30:21so you can expect to have more
- 30:24molecules captured in your library,
- 30:26but these are true biological sequences
- 30:29because at the origin you have two
- 30:32discrete are different RNA molecules.
- 30:34This is different from a PCR
- 30:37duplicates because these are created
- 30:40during the amplification step.
- 30:42So that's why the,
- 30:44UM,
- 30:44I are important because this discrimination
- 30:46between biological duplicates and
- 30:48technical duplicates is really important.
- 30:50When you perform many
- 30:51rounds of amplification,
- 30:52and this happens when you
- 30:54have a low input material,
- 30:56such as in some tricky libraries.
- 30:58When your sample you have a low amount
- 31:01of sample and this is the case in
- 31:04single cell approaches because again,
- 31:06you're starting from the amount of RNA
- 31:09that is extracted from one single set.
- 31:11So you can imagine to have
- 31:14a lot of amplification.
- 31:16Occurring in order to detect the gene.
- 31:19So this is the definition of the UMI
- 31:22is a randomized nucleotide sequence.
- 31:24Again depending on the library preparation
- 31:27and on the technique that you use
- 31:29that it can be 8 nucleotide longer.
- 31:3112 Nook tight long of the longest,
- 31:34the better.
- 31:35It's incorporated into the C DNA and the
- 31:37initial steps of their native protocol.
- 31:40So before the amplification step,
- 31:42so the goal of the UMI is to
- 31:44distinguish between amplified copies
- 31:46of the same earning molecule.
- 31:48Because these have the same C DNA sequence.
- 31:51But it is and they have the same,
- 31:53Umm,
- 31:53so they are technical duplicates
- 31:55and they are removed.
- 31:57Well,
- 31:57what you want to keep his reads
- 31:59from separator marine molecules
- 32:00transcribed from the same gene?
- 32:02Because these will have the same
- 32:04C DNA but will have a different.
- 32:06Umm,
- 32:06I so these are biological duplicates
- 32:08and they are kept so they are.
- 32:10My is a method to reduce
- 32:14the amplification noise.
- 32:15This is a graphical example
- 32:17of the importance,
- 32:18so this is an example where
- 32:19you have a reference sequence,
- 32:21so this is a region of a gene,
- 32:24for example, and in your experiment,
- 32:26for example in the same cell
- 32:27you get 10 reads
- 32:28with identical sequence,
- 32:30and so that they align to the same region.
- 32:33So if we assume that they are all
- 32:35PCR duplicates, we have to remove
- 32:37all of them and keep only one.
- 32:39And that means that when we calculate
- 32:42the abundance of the gene, if we don't.
- 32:44Remove the duplicate. We will say this.
- 32:47Gina is his account of 10.
- 32:50After the duplication we would say that
- 32:52the gene has account of one because by
- 32:55by using this approach we assume that
- 32:58all the duplicates are PCR duplicates.
- 33:01If we include the UM eyes,
- 33:03we can separate technical
- 33:05from biological duplicates.
- 33:06So we can use the Umm I hear different.
- 33:10Umm eyes are different colors in
- 33:12order to group technical duplicates.
- 33:15For example,
- 33:15these four at these three and these two,
- 33:19but we keep the biological duplicates.
- 33:21So in the end,
- 33:23instead of collapsing everything,
- 33:24we can keep four reads because
- 33:27they having four different mice,
- 33:29they probably correspond to four different
- 33:31original molecules in our sample.
- 33:39Is everything clear? Do you? Hear me
- 33:43I don't have interaction. Yes, yes.
- 33:47Yes, OK.
- 33:50OK, so that's why you use the cell
- 33:53barcode to identify the cell you use it.
- 33:55Umm, I to remove technical duplicates
- 33:57and that's why instead of counter matrix
- 34:00you can find also instead of number
- 34:02of reads you can find in single cell
- 34:04experiments the number of UM eyes,
- 34:06because basically what you are
- 34:08doing you are collapsing reads,
- 34:10the transcribed so mapping to
- 34:11the same gene and with the same.
- 34:14Umm I. And after you do all these
- 34:18steps that you can, uh,
- 34:20you can arrive to your account
- 34:22metrics in the single cell.
- 34:23It's called digital expression matrix,
- 34:25because it represents the number of reads
- 34:29mapping to 1 gene in each of your cells.
- 34:32And the other all sequence data
- 34:34are always end in the balcony
- 34:36seeking the fast queue format.
- 34:38So that's how you receive your sequence.
- 34:41One of these steps,
- 34:42in order to quantify the expression,
- 34:44is that you align not the UMI and the bar
- 34:47code because those are only technical,
- 34:50but you.
- 34:51You align the read corresponding
- 34:52to your C DNA.
- 34:54The alignment tool for single cell
- 34:56RNA seek are on most all are the
- 34:59same as the one used for bulk RNA.
- 35:02Seek a.
- 35:02So here again,
- 35:04you see that there are multiple
- 35:06options and multiple align alignment
- 35:07tools for different applications.
- 35:09So for example,
- 35:10here you see the years of publication.
- 35:13It's not really updated.
- 35:15The methods that you see in red
- 35:17are the ones that were developed
- 35:20specifically for any seeker,
- 35:21and the star that you see here was
- 35:24developed like it was released in 2012.
- 35:27So almost ten years ago is one of the
- 35:30standard alignment tools for back.
- 35:32Kearney seek, we saw this with Everett.
- 35:35I think two weeks ago.
- 35:37It's also the most common tool.
- 35:40The default tools in many single
- 35:42cell pipelines.
- 35:43So almost all of them will use
- 35:46star or high SAT or another.
- 35:48Ernie Caecum, splines,
- 35:50aware aligner tool in order
- 35:51to perform the alignment,
- 35:53so this is not so different
- 35:56from the balcony sick.
- 35:57Also,
- 35:58the alignment output that you
- 36:00receive will be a bum file, so.
- 36:03This is a file where each original
- 36:05reader containers the information
- 36:07on the alignment so it contains the.
- 36:09This file contains the coordinate,
- 36:11so the chromosome and the genomic
- 36:13coordinator of the alignment,
- 36:14and you need to use this file
- 36:17in order to calculate the number
- 36:19of reads that map to each gene
- 36:21or each transcript.
- 36:23So also this is not different
- 36:25from the bunker and a secret.
- 36:27The only difference is for
- 36:29example you can have a different
- 36:31band files for each cell.
- 36:34Instead of only one bonfire.
- 36:38OK, so we what we covered so far is the
- 36:41first data preprocessing, so again we
- 36:43have cell barcode you MI and the RNA.
- 36:46You cluster cell according to your cluster
- 36:48reads according to the cell you simplify you,
- 36:51you remove technical duplicates and you
- 36:53arrive to your gene expression matrix.
- 36:55Your digital expression matrix.
- 36:56Now a big difference between account
- 36:59data that you can obtain it back versus
- 37:01single cell is what you see here.
- 37:03So this is a typical account matrix from
- 37:06a balcony sick and you can see the number.
- 37:09Are very high and rarely you see zero values.
- 37:13The single cell RNA seek are.
- 37:15These is what you obtain most of the time.
- 37:18I would say this is a very good.
- 37:20It is very high.
- 37:22Is an example with a low amount of zeros.
- 37:25So what you can see is that the numbers are
- 37:29lower and most of the values are zeros.
- 37:32So the fact that you have lower counts,
- 37:35it means that in all your analysis
- 37:37you will have a higher contribution
- 37:40of noise and this will bring you
- 37:42to a higher uncertainty in results.
- 37:45And if this is a big problem,
- 37:48it means that when you choose
- 37:50among different pipelines,
- 37:51you will have very different results.
- 37:54And but the origin of these is that.
- 37:59Your original values,
- 38:00your original quantification of expression
- 38:02values were generally very low and so
- 38:05the contribution of noise is higher.
- 38:07So this problem is one of the main
- 38:10problem in single cell RNA seek and
- 38:12at the moment is kind of unavoidable.
- 38:15I would say the second probably
- 38:18is you have several zeros.
- 38:21And some of these zeros are.
- 38:24Real zeros, meaning that in fact
- 38:26sell the gene is not expressed,
- 38:29so this corresponds to through
- 38:31biological zeros they represent
- 38:32the true lack of expression,
- 38:34but many times the zeros represent
- 38:36a technical lack of detection,
- 38:38meaning that the gene was
- 38:40present in your cell,
- 38:41but it was not detected because it was.
- 38:44It was not captured by your Beda,
- 38:47and so you don't have a way to see
- 38:49your gene because you didn't detect
- 38:52it in your library preparation.
- 38:54And obviously,
- 38:55in methods where you have a a
- 38:57low coverage inside each cell,
- 39:00the probability of these technical
- 39:02detection lack of detection.
- 39:03This is also called dropout
- 39:05effect is very high,
- 39:07so I think in the 10X approaches the dropout.
- 39:10If you can expect your jeans to be
- 39:13not detected with 80% of probability.
- 39:16Obviously this depends also on
- 39:18whether the gene is highly expressed
- 39:20or as low expression level.
- 39:22So if a gene has high expression level.
- 39:25Capability to be detected at least
- 39:27with one molecule is higher,
- 39:29but jeans with low expression levels,
- 39:31for example transcription factors,
- 39:32will rarely be detected,
- 39:34but most of the times it will be
- 39:36because of a detection problem,
- 39:38not because they are not expressed
- 39:40in the cell.
- 39:41And also this is inherent problem
- 39:43with a single cell data analysis.
- 39:45So it's very important to.
- 39:48Doodle
- 39:51uh no.
- 39:55No, yes, so now we wouldn't be
- 39:58ready at the end of the of the time.
- 40:01So one thing we could do is I I could
- 40:04continue and finish the next time.
- 40:09With the remaining of the analysis
- 40:11steps, sure, yeah, I think Tom,
- 40:12it's your own judgment to
- 40:14how you want to proceed.
- 40:15Like do you think it is a natural stuff?
- 40:17Then you can stop if you think
- 40:19you want to cover 5 more minutes,
- 40:21go ahead and do that so.
- 40:24I can give you like a sort of anticipation
- 40:27on the on the following steps.
- 40:29So uhm, many of these steps that.
- 40:33I mean, many, many of these steps II
- 40:36took inspiration from this review that
- 40:39was recently published in a true method.
- 40:42So it covers the main trials.
- 40:44So the successes and also the
- 40:47limitations of the computational
- 40:49methods for the single cell RNA
- 40:51seek analysis and so next time we'll
- 40:54cover the key preprocessing steps.
- 40:56We have seen this the molecular counting,
- 40:59but we will see.
- 41:01So how we can do quality control?
- 41:04Remove Excel said that are suspicious,
- 41:06for example because they are dying or
- 41:09because they represent the empty droplets
- 41:12or because they represented tablets.
- 41:14So doublets occur when you didn't really
- 41:17manage to separate physically the cells.
- 41:19So there for example in the
- 41:22same droplet you have two cells,
- 41:24or for some reason the cell barcode
- 41:27of two different cells was shared.
- 41:30For some technical problem,
- 41:32so we'll see methods to remove.
- 41:34Dying cell and those who tablets then there
- 41:37are problems related to the normalization.
- 41:40So how to consider how to consider
- 41:42the fact that you have a different
- 41:45read different number of reads in
- 41:47different cells and this is a problem
- 41:50because the biologically speaking you
- 41:52expect the cell of different types to
- 41:55have a different amount over in Asia,
- 41:58so expect some cells you have more
- 42:00any molecules than other cells,
- 42:03but most of the methods.
- 42:05Assume that you needed to have to
- 42:08have from each cell the same number of
- 42:11reads or UMI. And then we will see how.
- 42:14So how to remove a jeans that are
- 42:17not important in the analysis?
- 42:19This is very important because as
- 42:20you can imagine in single cell you
- 42:23have thousands of genes and also
- 42:25thousands of cells are.
- 42:26So your account matrix is very highly
- 42:29dimensional and so all of these methods
- 42:31try to reduce the number of cells
- 42:33are keeping only the high quality
- 42:35sales but also the number of genes.
- 42:38So reducing like the dimensionality
- 42:39of of your data.
- 42:46Leash.
- 42:48That sounds amazing. I look forward to
- 42:51next week. Yeah. And also next week
- 42:53when you see the downstream analysis
- 42:55that for example class ring for single
- 42:58cell approaches and also trajectory
- 43:01possibly also trajectory estimation.
- 43:05Yeah, thanks so much and nothing really.
- 43:07Want to comment. Is your your slides
- 43:09look very beautiful and I wish all my
- 43:12slides and possibly on people in my life.
- 43:14There's as well looks as nice when
- 43:17I try and try to put a course
- 43:19on this so it's kind of a
- 43:21maybe we should use your slides as template.