Skip to Main Content

Analysis and Interpretation of single cells sequencing data – part 1 Introduction and alignment

August 25, 2021
ID
6875

Transcript

  • 00:00Because now the the main focus of today
  • 00:02will be the analysis and interpretation
  • 00:05of single cell sequencing data.
  • 00:07So we won't cover everything today
  • 00:09and so it will take at least another.
  • 00:14Another meeting for covering everything.
  • 00:16But today we covered the
  • 00:18introduction on the methodologies,
  • 00:20some technical and experimental issues,
  • 00:22and some issues also with the
  • 00:24with the analysis of this data.
  • 00:27So single cell analysis as a
  • 00:29definition is the study of omics.
  • 00:31At least that's what we're speaking about.
  • 00:34Today is the study of omics so
  • 00:37genomics transcriptomics proteomics
  • 00:39at the single cell level.
  • 00:40So the advantage is that these.
  • 00:43Family of methods allowed to capture
  • 00:46a cellular diversity of tissues with
  • 00:49the with the single cell resolution.
  • 00:52Uh, so they feel there is a bursting is
  • 00:55like exploding with a lot with a number
  • 00:59of novel experimental techniques every year.
  • 01:02But there are also many
  • 01:04computational challenges,
  • 01:05so these methods,
  • 01:06the single cell methods require the
  • 01:09development of appropriate analysis.
  • 01:10And so we will see that.
  • 01:14Common workflows are an employee,
  • 01:16some generic.
  • 01:17For example clustering analysis that
  • 01:20June spoke about in our first meeting.
  • 01:23Some of the methods for the normalization,
  • 01:26for example,
  • 01:27or for the calculation of differential gene
  • 01:31expression are taken from the bulk RNA seek,
  • 01:34but it is not always the best choice.
  • 01:38And since the field is rapidly moving,
  • 01:41there is no gold standard I would say.
  • 01:45In any step of the analysis,
  • 01:47so you will find a lot of methods,
  • 01:50a lot of applications.
  • 01:52You can find the literature compare
  • 01:54for each step of the analysis,
  • 01:56alternative approaches,
  • 01:57but there is no like gold reference
  • 02:00that you that that you can choose.
  • 02:02For example,
  • 02:03there is a sort of called pipelines in
  • 02:06the bulk RNA seek and the single cell.
  • 02:09It's not so established.
  • 02:13This is a comparison of the
  • 02:15method single cell versus bulk,
  • 02:16so in the back analysis you take a tissue
  • 02:19or population of cells and you extract
  • 02:21DNA from the whole population so that
  • 02:24you mix up the RNA content in the same.
  • 02:29Yeah, in the same container, let's say,
  • 02:31and then when you prepare the library
  • 02:34and you sequence DNA from the whole from
  • 02:37from the whole population of cells.
  • 02:40This means that you get for each library from
  • 02:43each collection of cells from each tissue.
  • 02:46Only one measurement.
  • 02:47And this measurement of genes represents
  • 02:50the average expression of these genes
  • 02:53across all the details of your tissue.
  • 02:56So obviously you cannot use the back
  • 02:58command code if you want, for example,
  • 03:01to see this cellular it originality in your
  • 03:04tissue with a single cell analysis you yeah,
  • 03:07you first perform a a step that
  • 03:10is the isolation of the cells.
  • 03:12So this is kind of tricky,
  • 03:14especially in solid tissues because you
  • 03:17need to mechanically separated each cell.
  • 03:19It's easier with the liquid that issues.
  • 03:22It's easiest, for example,
  • 03:23when you consider the analysis of.
  • 03:26Democratic cells and so inside each
  • 03:28single cell you you perform the
  • 03:30quantification of gene expression.
  • 03:32Because you have a way to create a
  • 03:35library where you can keep track
  • 03:38of the cell of origin of each RNA,
  • 03:41and so that's why,
  • 03:42then you can quantify for each gene
  • 03:45at the expression in each single cell,
  • 03:48so that each cell has a distinct
  • 03:51expression profiles.
  • 03:51For example,
  • 03:52this cell expresses only one gene.
  • 03:54These other cells express.
  • 03:57Different multiple genes and with
  • 03:59different amounts and so that you
  • 04:01can use that.
  • 04:02This difference in the expression
  • 04:05between different cells in order
  • 04:08to see how much cells are similar
  • 04:10to each other or different.
  • 04:12So for example,
  • 04:13you can perform clustering analysis of
  • 04:16cells based on their expression profiles
  • 04:18and also other downstream analysis.
  • 04:21So obviously you have a richer data.
  • 04:25That you can see and UM.
  • 04:28And you have multiple more options
  • 04:32in India in the final analysis.
  • 04:36So yes,
  • 04:36when it was launched is like there
  • 04:39was this kind of comparison between
  • 04:41bike and array seek that Vulcan.
  • 04:43Alesis is like the analysis of through.
  • 04:46It seems Moody and the single cell.
  • 04:48It's like the analysis of a fruit
  • 04:51salad where you can distinguish the
  • 04:53contribution of each fluid for each
  • 04:56fruit is a different cell type or subtype.
  • 04:59Now the main application for single
  • 05:01cell RNA sequencing when we're speaking
  • 05:04about discrimination among different cells.
  • 05:07There are multiple. I divided the.
  • 05:09These are in two branches,
  • 05:11so why is it so cold?
  • 05:14The discrete analysis?
  • 05:15So you you have a,
  • 05:17you collect the expression,
  • 05:18abundance of transcripts of genes
  • 05:21inside each cell and you want to
  • 05:23cluster cells in order to identify
  • 05:25different cell types.
  • 05:26For example,
  • 05:27this cell types that compose the
  • 05:29tissue that you're studying.
  • 05:31So this is a discrete analysis because
  • 05:34you are assuming that your tissue is
  • 05:36composed by different types of cells
  • 05:39that are clearly.
  • 05:40Distinguishable from each other,
  • 05:42and so these analysis has,
  • 05:44for example, something to do
  • 05:46with the class with clustering,
  • 05:49because ultimately you want to
  • 05:51identify separate clusters of
  • 05:53cells based on their expression
  • 05:55profile on the right question.
  • 05:57Every question this is super
  • 05:59relevant to hematopoiesis,
  • 06:01so there's even controversy.
  • 06:03I don't know that it should
  • 06:05be a controversy about whether
  • 06:08there are discrete cell states.
  • 06:10Versus everything being continuous
  • 06:12and logic tells me that there's
  • 06:15going to be a continuous change
  • 06:18in a bazillion different genes,
  • 06:20because every cell is going
  • 06:22to be slightly different.
  • 06:24So do you have to actually ask
  • 06:27the algorithm to analyze the
  • 06:29data to find discrete sets versus
  • 06:32find a continuous analysis?
  • 06:36So I personally don't know if there is a way.
  • 06:41If there is a tool so I never use that
  • 06:44tool that tells you if the best analysis
  • 06:47is discrete or continuous. OK, I think
  • 06:50that probably if we looked at different
  • 06:52papers where they claim it's discrete
  • 06:54versus claiming it's continuous
  • 06:56that we would find differences in
  • 06:58how they analyzed it, yes,
  • 07:00so the the priority knowledge of the
  • 07:02sample is something that you can user,
  • 07:05and for example if you take for
  • 07:07example a peripheral blood,
  • 07:09if you take single cell data sets
  • 07:11of peripheral blood are where
  • 07:13most of these cells are mature
  • 07:15and already differentiated,
  • 07:16then you see clearly that you have.
  • 07:19Very separated discrete clusters,
  • 07:21and so it makes more sense to perform
  • 07:25a discrete analysis or clustering
  • 07:27analysis if you take them on marrow
  • 07:30or a population that isn't reached
  • 07:32for stem cells or progenitors,
  • 07:34then you expect to have a more continuous
  • 07:39representation of your sample.
  • 07:41And so this is important because
  • 07:43whatever tool that you use,
  • 07:46any clustering analysis will
  • 07:47give you clustering and any like
  • 07:50continuous analysis such as like
  • 07:52inference of trajectory.
  • 07:53We'll find the trajectory.
  • 07:55So if you submit your sample to any analysis,
  • 07:59you will obtain result,
  • 08:01but the result can be meaningless.
  • 08:04For example a continuous analysis
  • 08:06can be minutes meaningless if
  • 08:09your sample is biologically not.
  • 08:11Uhm,
  • 08:12for example,
  • 08:13something that is differentiating
  • 08:15or developing.
  • 08:19So yeah, yeah, so yeah.
  • 08:22And that's that's the parallel.
  • 08:24So whenever you see they they like.
  • 08:29Courses or tutorials on continuous analysis.
  • 08:31That's something you need to be careful.
  • 08:33You will always get a graph.
  • 08:35You will always get like a sort
  • 08:38of differentiation trees the tree,
  • 08:39but you have to be careful because
  • 08:42sometimes it doesn't make sense.
  • 08:46To make the analysis at all.
  • 08:49Because it's one of these,
  • 08:50one of the assumption of a
  • 08:52continuous analysis is that you
  • 08:54have a sampling of the continuous
  • 08:56process that you're trying to model.
  • 08:58For example, development or differentiation.
  • 09:00If this is not true,
  • 09:01you don't have an assumption
  • 09:03to do the analysis at all.
  • 09:07So tomorrow you so so I think this
  • 09:10is a very important question that
  • 09:12I raised because you know in real
  • 09:15life situations we will get samples
  • 09:17sequenced and and how do we tell if
  • 09:20this is reasonable or not reasonable
  • 09:22so so just wonder if anyone has done
  • 09:25a very careful analysis to sort of,
  • 09:27you know something ground truth for example,
  • 09:30you have two discrete cell states you
  • 09:32already isolated or somehow maintained and
  • 09:34put them into a single cell sequencing.
  • 09:37And then you force it to assume
  • 09:40trajectory based methodology and do
  • 09:41cause many major artifacts or not.
  • 09:43I think that's one of the ways
  • 09:46to think about.
  • 09:48Yeah, so, uh, so I'm not aware meaning
  • 09:51that I never use the like tools that
  • 09:55explicitly tell you which one which
  • 09:58branch of the analysis is better.
  • 10:01So by exploratory analysis,
  • 10:02for example, we will see why when
  • 10:06you do the preprocessing and then the
  • 10:09dimensionality reduction and you have
  • 10:12a lot like a on a hyperplane of cells.
  • 10:16That you can like a guess,
  • 10:18depending on the structure of your sample,
  • 10:21whether it's more reasonable to
  • 10:23proceed with the discrete cluster or
  • 10:25to perform like a trajectory or boss.
  • 10:27So sometimes, for example, if you it would,
  • 10:30it could make sense to start with an
  • 10:33exploratory analysis on all the cells.
  • 10:36I think this as an example because
  • 10:38it's on this life,
  • 10:40so this seems to be like separate
  • 10:43cluster of cells.
  • 10:45It could be reasonable,
  • 10:46then to select only this
  • 10:48cluster within this cluster.
  • 10:50We don't see clear subclasses,
  • 10:51so within this cluster it may.
  • 10:54It could make sense to perform
  • 10:56a trajectory analysis to see if
  • 10:58there is a continuous process,
  • 11:00but not at the beginning
  • 11:02taking consideration.
  • 11:03Also, these two clusters here
  • 11:05because they're clearly separated.
  • 11:07So sometimes I think the that
  • 11:09the workflow can be also mixed.
  • 11:12So you start with all these cells,
  • 11:14so you remove clear outlier clusters.
  • 11:16Maybe you annotate the cluster
  • 11:18so that you know,
  • 11:20for example,
  • 11:20that inside your population you have a
  • 11:23mixture of progenitors or stem cells,
  • 11:25and inside that cluster.
  • 11:28Perform the trajectory analysis.
  • 11:33I see so that would be
  • 11:35my yes tentative answer.
  • 11:36Now I don't know if anyone
  • 11:38else has other suggestions.
  • 11:46That maybe we can leave.
  • 11:48I can find those some material
  • 11:49for next time to see if I can
  • 11:52answer more, like extensively.
  • 11:55I think it's a tough question
  • 11:56because I don't think there's
  • 11:57a consensus necessarily in the
  • 11:59field and people just to see.
  • 12:00OK, if it makes sense or it doesn't
  • 12:02make sense to their own eyes.
  • 12:05Yep. Yeah, again I I don't know
  • 12:07if someone is trying to to build
  • 12:09some tools that yeah you know,
  • 12:12yeah that like kind of quantify the.
  • 12:16Reasonableness of each
  • 12:18of the approaches. Yeah.
  • 12:23But yes, it's an important distinction.
  • 12:26Also, uh, yes. Also also later, 'cause it.
  • 12:30Uhm, some history.
  • 12:32So this is the first publication on
  • 12:34single cell sequencing, so it's a 20.
  • 12:37No sorry 12 years ago,
  • 12:39so it was a nice seat cover.
  • 12:42The whole transcriptome of a single
  • 12:45cell so it was really one single cell
  • 12:48because it was a mouse blaster that
  • 12:51was isolated with a microscope so it
  • 12:54was manually picked under my screw.
  • 12:56A microscope then lies and then
  • 12:58sequenced and together with the blaster.
  • 13:01Also, 50 sites were also analyzed
  • 13:03and so so basically the trick here
  • 13:06to reach the single cell resolution
  • 13:09was the isolation of these cells
  • 13:11and then that the procedure was
  • 13:14standard Lisa and then library
  • 13:16preparation as in a as in balcony seek.
  • 13:19But from the starting from 1 cell.
  • 13:23So from that to the fielder,
  • 13:26as I told you,
  • 13:27an exploded and so in this plot here.
  • 13:31So this is from a review that was 2018,
  • 13:34so it was ten years after these
  • 13:37first publication.
  • 13:38And what you can see is the
  • 13:40release of multiple approaches
  • 13:42for single cell at any seeker.
  • 13:44UM, that increase the number
  • 13:46of cells that you can study.
  • 13:48So obviously that one alone
  • 13:50was a proof of concept,
  • 13:52but the real.
  • 13:54Single cell explosion happened
  • 13:55when you put when you will be
  • 13:58able to parallelize the process.
  • 13:59So where you were able to capture
  • 14:01a single cell expression level of
  • 14:03first hundreds and then thousands
  • 14:05and then millions of cells.
  • 14:07So here you see the publication data of
  • 14:10the techniques and the single cells.
  • 14:12The number of single cells
  • 14:14that were analyzed.
  • 14:15So this is our first with only one
  • 14:17cells and then you see that the trend
  • 14:20is to release techniques that allow
  • 14:22you to increase the high throughput.
  • 14:24In terms of the number of
  • 14:27cells that you can quantify,
  • 14:29you can consider in each experiment.
  • 14:33Question, yeah, I don't know if
  • 14:35you're going to get to this.
  • 14:36So if you are just saying never mind.
  • 14:40As the number of cells that are
  • 14:42being sequenced has increased,
  • 14:44the number of reads per cell that
  • 14:46people get and report on has decreased,
  • 14:49and I'd like to understand is
  • 14:51that just because that's what's
  • 14:53convenient in terms of putting
  • 14:54it onto an illuminous sequencer,
  • 14:56or is there something about the various
  • 14:59techniques where you reach the limit
  • 15:01of your detection after X number of
  • 15:04reads and it's not worth getting more?
  • 15:08Yes, so there is a tradeoff
  • 15:10in in these two parameters.
  • 15:11One is the number of cells that you consider
  • 15:14and the other is the number of reads
  • 15:17there that you obtained for each cell.
  • 15:20The trend for the techniques has been
  • 15:23mainly to increase the number of cells,
  • 15:27and obviously these was against.
  • 15:29This is against the number
  • 15:31of reads for each cell.
  • 15:34So for example and.
  • 15:38So let's say that the fielder and
  • 15:40the the most popular techniques,
  • 15:42for example Tenax, have been
  • 15:44increasing more the number of cells,
  • 15:46then the number of reads for each cell.
  • 15:50Uh, these are depends on the
  • 15:52application of the method I guess.
  • 15:54So obviously if you're interested in
  • 15:57the cell as your unit of interest,
  • 15:59so if you're interested in
  • 16:01like more cellular biology,
  • 16:03you're interested more in capturing
  • 16:05cells and separating cells,
  • 16:06you're not so interesting looking
  • 16:08with in great detail on thousands
  • 16:10of genes that are expressed within
  • 16:13each cell on the other side,
  • 16:15if you're more interested,
  • 16:16for example in the molecular
  • 16:18biology rather than.
  • 16:20Just separating cells so it would
  • 16:22be more interesting to increase the
  • 16:25depth of the sequencing in each cell.
  • 16:29There are techniques where these is
  • 16:32maximized and obviously the trade off
  • 16:35is that you cannot get so many cells.
  • 16:38As for the other method.
  • 16:42Uhm?
  • 16:45I. Think I'm
  • 16:47just asking also and maybe June
  • 16:49knows is there a maximum number
  • 16:50of reads you want per cell?
  • 16:53Because after that you don't
  • 16:54get any additional information.
  • 16:57So we will see we can measure when
  • 17:00you reach the like the the plateau
  • 17:03when you reach the plateau of the
  • 17:07sequencing using tricks such as the UMI.
  • 17:10So if you append to each
  • 17:12read like a random barcode,
  • 17:14you can see when a it doesn't make sense
  • 17:17to sequence more depth because all the
  • 17:20additional reads that you are detecting.
  • 17:22Other PCR duplicates of what
  • 17:25you already sequenced. Right,
  • 17:27got it? Yeah I
  • 17:28agree with the Tommaso.
  • 17:29So I think there are some studies was
  • 17:32down those kind of things before,
  • 17:33but they were using different
  • 17:35technologies compared to what you're
  • 17:37going to use probably right now,
  • 17:38so I'm not sure whether for every
  • 17:40single technology out there,
  • 17:42there has already been a paper published.
  • 17:44Maybe for 10X there's already paper
  • 17:45published on the standard procedures,
  • 17:47but in your own data you can actually
  • 17:49analyze yourself to see whether
  • 17:51it's approaching saturation or not,
  • 17:53and you can re sequence more from
  • 17:55same library if you want to.
  • 17:57Yeah. Yes, yes using Umm eyes
  • 18:00and you can measure that.
  • 18:02OK, I understand, thank you then.
  • 18:05The same techniques,
  • 18:06for example 10X at every
  • 18:09release like increase or the.
  • 18:13Increase the detection of multiple
  • 18:15molecules inside each cell so that
  • 18:17the saturation limit is higher and so
  • 18:19it really depends also on the depends
  • 18:22on the technique and then and on the
  • 18:24version of the of the technique itself.
  • 18:28But in general, I would say that
  • 18:30the number of cells that you can
  • 18:33measure has increased the more in
  • 18:35the average of the techniques.
  • 18:37Then the depth that then the,
  • 18:40then the within cell depth.
  • 18:42With an exception that they
  • 18:44show you in your Indies slide,
  • 18:47that is the smart speaker
  • 18:49family of technical.
  • 18:50So this family of techniques is
  • 18:53the ideal family when you are not
  • 18:56interested in capturing a lot of cells,
  • 18:59but you want to maximize the analysis
  • 19:02within each cell and the advantage of
  • 19:05dictating of these techniques is that
  • 19:08you can have 1,000,000 read for each cell.
  • 19:11So it has a high coverage and also and
  • 19:14it's one of the techniques that allow you
  • 19:18to capture reads from the whole transcript.
  • 19:21So we will see that the majority
  • 19:23of commercial techniques,
  • 19:25such as the 10X A.
  • 19:28Do not allow you to cover
  • 19:30the full transcript,
  • 19:31but they are like three prime
  • 19:32end or five prime.
  • 19:34End the libraries.
  • 19:35So that means that you can capture
  • 19:37only the fragment that is near to the
  • 19:39palie for the three prime end or near
  • 19:41to the cap for the five prime end.
  • 19:43This is one of the few methods where
  • 19:46you can add in bulk and in most by
  • 19:48Karen Acq can capture reads from
  • 19:50the full transcript and this is an
  • 19:52advantage because for example if
  • 19:54you want to do splicing analysis,
  • 19:56that's the only way you can.
  • 19:58You know that that that that's the
  • 20:01only method you can use to have a like.
  • 20:04To perform splicing and a full
  • 20:06splicing analysis, otherwise,
  • 20:07you can prefer splicing analysis
  • 20:09only on the initial exon,
  • 20:10five prime end or on the terminal
  • 20:13axons and also for a Allen Alesys.
  • 20:15So analysis of variations analysis of
  • 20:17Snips from Renee see from RNA seek
  • 20:20A if you're a mutation of interest.
  • 20:22If your variation of interest is
  • 20:25inside the body of the gene and not
  • 20:28at the five prime or the three prime.
  • 20:31So this has all the advantages of
  • 20:34allowing analysis within each cell
  • 20:37that is comparable to the biker,
  • 20:39any SQL.
  • 20:42It has a limitation that is shared
  • 20:44with other techniques.
  • 20:46Is that most of the single cell
  • 20:48techniques right now allow you to
  • 20:51detect the only polyadenylated RNA
  • 20:53because they're based on quality selection.
  • 20:56And you have a low number of cells that
  • 20:58you can sequence in each experiment,
  • 21:01so less than 1000.
  • 21:02Then the smart Seeker has already 3 version,
  • 21:05so it was released first in 2012.
  • 21:07Then here is a smart seek to release
  • 21:09the one year after and then the
  • 21:12latest is March 6th 3 that was
  • 21:14released the last year.
  • 21:16So each of these kind of increase.
  • 21:19Then then the number of usable
  • 21:22reads with smart smart
  • 21:23seek two didn't allow to use the.
  • 21:26Umm, I but Smartsilk 3 allows to to use also.
  • 21:30Umm, I and this is a comparison
  • 21:32between the two versions.
  • 21:34Smartsilk 2 smartest see where you
  • 21:36see the box block with the number of
  • 21:40genes that are detected within each
  • 21:42cell and as you see with the mastic
  • 21:45tree you can foreach seller cover
  • 21:47detector from 10,000 to 12,000 jeans.
  • 21:50And also this number it is comparable
  • 21:52to buy her any seek if you compare
  • 21:55this number with the other method
  • 21:57cells such as SYNNEX, internex.
  • 21:59I think the average is 3 to 5000 genes for
  • 22:03each cell when when this value is high.
  • 22:11OK, so this is an example of high coverage,
  • 22:14but low throughput are.
  • 22:15On the other hand there you have methods
  • 22:18where you have low coverage inside
  • 22:20each cell and but high throughput and
  • 22:23a family of these methods they are
  • 22:25the so-called droplet based methods.
  • 22:27These was one of the first set that was
  • 22:31really that was released and it is the.
  • 22:35I had the drop seeker analysis,
  • 22:37so the principle is to isolate cells.
  • 22:42Single cells in single droplet.
  • 22:44Will you have your cell and you have a
  • 22:47barcode that beats are barcoded beads.
  • 22:49Allow you to attach a cellular barcode
  • 22:53that is unique for each beat down
  • 22:56and so it's unique for each cell.
  • 22:59And so that's the trick that is
  • 23:02used in order to, uh,
  • 23:04associated the content of each
  • 23:06cell with a single barcode.
  • 23:08That is the cell barcode.
  • 23:12Uhm it, so it allows to map 1000 or 10s of
  • 23:16thousands of cells in the same experiments.
  • 23:19Uhm, it's so the drops eater is
  • 23:21only three prime end sequencing
  • 23:24and it allows you the use of,
  • 23:26Umm unique molecular identifiers.
  • 23:28So I will have some slides later to
  • 23:31show what is the meaning of that.
  • 23:34This is the pipeline.
  • 23:36Are the experimental pipeline
  • 23:37of a drop seek experiment.
  • 23:40So the the principle is twice.
  • 23:43Let some point in a droplet,
  • 23:45one cell with one microparticle.
  • 23:47Inside this droplet you have the capture of
  • 23:50the polyadenylated RNA with a polety probe,
  • 23:53and then you have the little transcript,
  • 23:56the reverse transcription and the generation
  • 23:58of the C DNA and the library preparation.
  • 24:01This is kind of similar to
  • 24:04also buy currency approaches.
  • 24:06Very similar to drop seek
  • 24:08is also the 10X approach.
  • 24:10That is the commercial
  • 24:12development of the drug seeker,
  • 24:14so send it to next.
  • 24:16You have the same strategy of
  • 24:18dividing dividing cells so
  • 24:20that you have droplets in oil,
  • 24:22in this case with a single cell and a
  • 24:25single barcode with the cellular barcode.
  • 24:28And as you can see the barcode attached
  • 24:31to each bid have standard adapter that
  • 24:33you can use in Illumina sequencing.
  • 24:36You have the cellular barcode,
  • 24:38you have the Umm I,
  • 24:40and then you have a positive probe
  • 24:42that is used to capture Poly a RNA.
  • 24:47Uhm, this is to remind
  • 24:49that different platforms,
  • 24:50according to different strategies have
  • 24:51different gene coverage is so smart.
  • 24:53Seek two that we saw before
  • 24:55has a full coverage.
  • 24:57So if you consider these are like meta gene,
  • 25:00we have the five prime UTR,
  • 25:02the body of the gene,
  • 25:03the coding sequence and the three prime UTR
  • 25:06you have coverage of the full transcript,
  • 25:08while with 10X or free payment
  • 25:10method you haven't richemond only
  • 25:12at the three prime end of the
  • 25:14transcript with the five prime method.
  • 25:16You haven't richemond all
  • 25:17yet to five prime end,
  • 25:19so you need to be careful on
  • 25:21which library you are using,
  • 25:22because if it is for just
  • 25:25for gene quantification.
  • 25:27Methods can be comparable,
  • 25:28but if you are interested,
  • 25:30for example in a ice form,
  • 25:32expressions, pricing,
  • 25:33analysis and so on only.
  • 25:35These methods allow you to
  • 25:37perform a complete analysis,
  • 25:38not these ones.
  • 25:42And this is another plot
  • 25:43comparing the aging coverage are
  • 25:45when you have full coverage.
  • 25:46So this plot here is similar
  • 25:48to plots that you could obtain
  • 25:50from back button a seeker.
  • 25:52This method is free prime end.
  • 25:54There is a free payment
  • 25:56method and so you see there.
  • 25:58Richmond at the free
  • 25:59prime of the transcript.
  • 26:04Now this was for many for the technical part.
  • 26:08Now the outlook on the computational
  • 26:11analysis of single seller is
  • 26:13resumed by these workflow.
  • 26:15So most of their most popular methods
  • 26:18that allow you to generate libraries
  • 26:20and the result will be read the
  • 26:23sequence with a standard platform
  • 26:25such as illuminum some single cell
  • 26:28methods have been published also
  • 26:30that user full length sequencing.
  • 26:32They think they're so probably
  • 26:35in the future they would be.
  • 26:39Use that more,
  • 26:40but right now the standard is
  • 26:42to use short read sequencing.
  • 26:44Couple to see what self analysis so
  • 26:46we will see how RO data are obtained
  • 26:49and how the raw data reads can be
  • 26:52transformed into count matrices
  • 26:54that are similar to the count
  • 26:56matrixes of the bike and a secret.
  • 26:59But you have instead of having samples
  • 27:02and jeans you have single cells and
  • 27:04genes in your matrix and the numbers
  • 27:07correspond to the number of reads
  • 27:09mapping to the gene in the cell.
  • 27:12Then there are quality control methodologies.
  • 27:14Is well isation methodology's class ring,
  • 27:17uh identification of trajectories.
  • 27:19So like analysis that assume your
  • 27:23sample your population of cells
  • 27:26user continues and methods assume
  • 27:28that your population is discrete.
  • 27:31So let's start from the beginning.
  • 27:33So usually in most of the methods row
  • 27:35in in the row reads that you receive.
  • 27:38There are three important parts
  • 27:39that you have,
  • 27:40and there are three parts of this sequence.
  • 27:43And so the first important part
  • 27:45is the cell barcode.
  • 27:47So this is a an oligonucleotides there that
  • 27:50can be like 8 to 12 or more nucleotide long.
  • 27:53This depends on the on the technique
  • 27:55and the so it's these sequence the
  • 27:58cell barcode is unique for each of the bids.
  • 28:01For example that you used.
  • 28:04That when your cell was
  • 28:05in the in the droplet,
  • 28:07so it's what you use to identify the cell,
  • 28:10meaning that one of the first step
  • 28:12is to look at this region of the
  • 28:14reader that correspond to the cell
  • 28:16barcode and the group together.
  • 28:18All the reads that have the same
  • 28:20barcode and that's what you see here.
  • 28:22So all the reads with the same barcode
  • 28:25here in red belong to sell one,
  • 28:27because this is the cell barcode
  • 28:29of these cells and so on.
  • 28:31So that all reads are grouped
  • 28:33according to the value.
  • 28:34Off the barcode.
  • 28:36So obviously here there are some
  • 28:39methodology to account for possible errors
  • 28:41in the sequencing of the barcode so that.
  • 28:45Barcodes are realized in a
  • 28:47way that they have multiple,
  • 28:49multiple different nucleotides,
  • 28:50so that if you make one error only,
  • 28:54you don't.
  • 28:54You don't switch from one cell to another,
  • 28:58but you need at least,
  • 29:00for example,
  • 29:01three errors in this sequencing.
  • 29:03In the bar code to identify at
  • 29:05the wrong self for the reader.
  • 29:10The second part that is
  • 29:12important is the so called.
  • 29:14Umm I so this is not a while.
  • 29:17The cell barcode is unique for each cell.
  • 29:20The UMI is unique for each of the
  • 29:23original molecule in your sample,
  • 29:25and that's because there in the
  • 29:27library preparation strategies.
  • 29:28This is A is a non legal nucleotide
  • 29:31that is included is appended to the.
  • 29:35Library during the cDNA.
  • 29:39Transgeneration before
  • 29:40the amplification steps.
  • 29:42So before the PCR.
  • 29:44So that this means that.
  • 29:47These can be used this stretch.
  • 29:50This random bar code can be used to
  • 29:53discriminate between PCR duplicates
  • 29:54and the real biological duplicates.
  • 29:57So in early seat you can expect to see
  • 30:00two reads that are the same because
  • 30:03they were derived from two copies,
  • 30:06two different transcripts
  • 30:07transcribed from the same gene.
  • 30:09So some genes,
  • 30:11such as ribosomal the transcript
  • 30:12of ribosomal proteins,
  • 30:14are expected to be in the
  • 30:17range of 1000 to 10.
  • 30:191000 copies in a single cell,
  • 30:21so you can expect to have more
  • 30:24molecules captured in your library,
  • 30:26but these are true biological sequences
  • 30:29because at the origin you have two
  • 30:32discrete are different RNA molecules.
  • 30:34This is different from a PCR
  • 30:37duplicates because these are created
  • 30:40during the amplification step.
  • 30:42So that's why the,
  • 30:44UM,
  • 30:44I are important because this discrimination
  • 30:46between biological duplicates and
  • 30:48technical duplicates is really important.
  • 30:50When you perform many
  • 30:51rounds of amplification,
  • 30:52and this happens when you
  • 30:54have a low input material,
  • 30:56such as in some tricky libraries.
  • 30:58When your sample you have a low amount
  • 31:01of sample and this is the case in
  • 31:04single cell approaches because again,
  • 31:06you're starting from the amount of RNA
  • 31:09that is extracted from one single set.
  • 31:11So you can imagine to have
  • 31:14a lot of amplification.
  • 31:16Occurring in order to detect the gene.
  • 31:19So this is the definition of the UMI
  • 31:22is a randomized nucleotide sequence.
  • 31:24Again depending on the library preparation
  • 31:27and on the technique that you use
  • 31:29that it can be 8 nucleotide longer.
  • 31:3112 Nook tight long of the longest,
  • 31:34the better.
  • 31:35It's incorporated into the C DNA and the
  • 31:37initial steps of their native protocol.
  • 31:40So before the amplification step,
  • 31:42so the goal of the UMI is to
  • 31:44distinguish between amplified copies
  • 31:46of the same earning molecule.
  • 31:48Because these have the same C DNA sequence.
  • 31:51But it is and they have the same,
  • 31:53Umm,
  • 31:53so they are technical duplicates
  • 31:55and they are removed.
  • 31:57Well,
  • 31:57what you want to keep his reads
  • 31:59from separator marine molecules
  • 32:00transcribed from the same gene?
  • 32:02Because these will have the same
  • 32:04C DNA but will have a different.
  • 32:06Umm,
  • 32:06I so these are biological duplicates
  • 32:08and they are kept so they are.
  • 32:10My is a method to reduce
  • 32:14the amplification noise.
  • 32:15This is a graphical example
  • 32:17of the importance,
  • 32:18so this is an example where
  • 32:19you have a reference sequence,
  • 32:21so this is a region of a gene,
  • 32:24for example, and in your experiment,
  • 32:26for example in the same cell
  • 32:27you get 10 reads
  • 32:28with identical sequence,
  • 32:30and so that they align to the same region.
  • 32:33So if we assume that they are all
  • 32:35PCR duplicates, we have to remove
  • 32:37all of them and keep only one.
  • 32:39And that means that when we calculate
  • 32:42the abundance of the gene, if we don't.
  • 32:44Remove the duplicate. We will say this.
  • 32:47Gina is his account of 10.
  • 32:50After the duplication we would say that
  • 32:52the gene has account of one because by
  • 32:55by using this approach we assume that
  • 32:58all the duplicates are PCR duplicates.
  • 33:01If we include the UM eyes,
  • 33:03we can separate technical
  • 33:05from biological duplicates.
  • 33:06So we can use the Umm I hear different.
  • 33:10Umm eyes are different colors in
  • 33:12order to group technical duplicates.
  • 33:15For example,
  • 33:15these four at these three and these two,
  • 33:19but we keep the biological duplicates.
  • 33:21So in the end,
  • 33:23instead of collapsing everything,
  • 33:24we can keep four reads because
  • 33:27they having four different mice,
  • 33:29they probably correspond to four different
  • 33:31original molecules in our sample.
  • 33:39Is everything clear? Do you? Hear me
  • 33:43I don't have interaction. Yes, yes.
  • 33:47Yes, OK.
  • 33:50OK, so that's why you use the cell
  • 33:53barcode to identify the cell you use it.
  • 33:55Umm, I to remove technical duplicates
  • 33:57and that's why instead of counter matrix
  • 34:00you can find also instead of number
  • 34:02of reads you can find in single cell
  • 34:04experiments the number of UM eyes,
  • 34:06because basically what you are
  • 34:08doing you are collapsing reads,
  • 34:10the transcribed so mapping to
  • 34:11the same gene and with the same.
  • 34:14Umm I. And after you do all these
  • 34:18steps that you can, uh,
  • 34:20you can arrive to your account
  • 34:22metrics in the single cell.
  • 34:23It's called digital expression matrix,
  • 34:25because it represents the number of reads
  • 34:29mapping to 1 gene in each of your cells.
  • 34:32And the other all sequence data
  • 34:34are always end in the balcony
  • 34:36seeking the fast queue format.
  • 34:38So that's how you receive your sequence.
  • 34:41One of these steps,
  • 34:42in order to quantify the expression,
  • 34:44is that you align not the UMI and the bar
  • 34:47code because those are only technical,
  • 34:50but you.
  • 34:51You align the read corresponding
  • 34:52to your C DNA.
  • 34:54The alignment tool for single cell
  • 34:56RNA seek are on most all are the
  • 34:59same as the one used for bulk RNA.
  • 35:02Seek a.
  • 35:02So here again,
  • 35:04you see that there are multiple
  • 35:06options and multiple align alignment
  • 35:07tools for different applications.
  • 35:09So for example,
  • 35:10here you see the years of publication.
  • 35:13It's not really updated.
  • 35:15The methods that you see in red
  • 35:17are the ones that were developed
  • 35:20specifically for any seeker,
  • 35:21and the star that you see here was
  • 35:24developed like it was released in 2012.
  • 35:27So almost ten years ago is one of the
  • 35:30standard alignment tools for back.
  • 35:32Kearney seek, we saw this with Everett.
  • 35:35I think two weeks ago.
  • 35:37It's also the most common tool.
  • 35:40The default tools in many single
  • 35:42cell pipelines.
  • 35:43So almost all of them will use
  • 35:46star or high SAT or another.
  • 35:48Ernie Caecum, splines,
  • 35:50aware aligner tool in order
  • 35:51to perform the alignment,
  • 35:53so this is not so different
  • 35:56from the balcony sick.
  • 35:57Also,
  • 35:58the alignment output that you
  • 36:00receive will be a bum file, so.
  • 36:03This is a file where each original
  • 36:05reader containers the information
  • 36:07on the alignment so it contains the.
  • 36:09This file contains the coordinate,
  • 36:11so the chromosome and the genomic
  • 36:13coordinator of the alignment,
  • 36:14and you need to use this file
  • 36:17in order to calculate the number
  • 36:19of reads that map to each gene
  • 36:21or each transcript.
  • 36:23So also this is not different
  • 36:25from the bunker and a secret.
  • 36:27The only difference is for
  • 36:29example you can have a different
  • 36:31band files for each cell.
  • 36:34Instead of only one bonfire.
  • 36:38OK, so we what we covered so far is the
  • 36:41first data preprocessing, so again we
  • 36:43have cell barcode you MI and the RNA.
  • 36:46You cluster cell according to your cluster
  • 36:48reads according to the cell you simplify you,
  • 36:51you remove technical duplicates and you
  • 36:53arrive to your gene expression matrix.
  • 36:55Your digital expression matrix.
  • 36:56Now a big difference between account
  • 36:59data that you can obtain it back versus
  • 37:01single cell is what you see here.
  • 37:03So this is a typical account matrix from
  • 37:06a balcony sick and you can see the number.
  • 37:09Are very high and rarely you see zero values.
  • 37:13The single cell RNA seek are.
  • 37:15These is what you obtain most of the time.
  • 37:18I would say this is a very good.
  • 37:20It is very high.
  • 37:22Is an example with a low amount of zeros.
  • 37:25So what you can see is that the numbers are
  • 37:29lower and most of the values are zeros.
  • 37:32So the fact that you have lower counts,
  • 37:35it means that in all your analysis
  • 37:37you will have a higher contribution
  • 37:40of noise and this will bring you
  • 37:42to a higher uncertainty in results.
  • 37:45And if this is a big problem,
  • 37:48it means that when you choose
  • 37:50among different pipelines,
  • 37:51you will have very different results.
  • 37:54And but the origin of these is that.
  • 37:59Your original values,
  • 38:00your original quantification of expression
  • 38:02values were generally very low and so
  • 38:05the contribution of noise is higher.
  • 38:07So this problem is one of the main
  • 38:10problem in single cell RNA seek and
  • 38:12at the moment is kind of unavoidable.
  • 38:15I would say the second probably
  • 38:18is you have several zeros.
  • 38:21And some of these zeros are.
  • 38:24Real zeros, meaning that in fact
  • 38:26sell the gene is not expressed,
  • 38:29so this corresponds to through
  • 38:31biological zeros they represent
  • 38:32the true lack of expression,
  • 38:34but many times the zeros represent
  • 38:36a technical lack of detection,
  • 38:38meaning that the gene was
  • 38:40present in your cell,
  • 38:41but it was not detected because it was.
  • 38:44It was not captured by your Beda,
  • 38:47and so you don't have a way to see
  • 38:49your gene because you didn't detect
  • 38:52it in your library preparation.
  • 38:54And obviously,
  • 38:55in methods where you have a a
  • 38:57low coverage inside each cell,
  • 39:00the probability of these technical
  • 39:02detection lack of detection.
  • 39:03This is also called dropout
  • 39:05effect is very high,
  • 39:07so I think in the 10X approaches the dropout.
  • 39:10If you can expect your jeans to be
  • 39:13not detected with 80% of probability.
  • 39:16Obviously this depends also on
  • 39:18whether the gene is highly expressed
  • 39:20or as low expression level.
  • 39:22So if a gene has high expression level.
  • 39:25Capability to be detected at least
  • 39:27with one molecule is higher,
  • 39:29but jeans with low expression levels,
  • 39:31for example transcription factors,
  • 39:32will rarely be detected,
  • 39:34but most of the times it will be
  • 39:36because of a detection problem,
  • 39:38not because they are not expressed
  • 39:40in the cell.
  • 39:41And also this is inherent problem
  • 39:43with a single cell data analysis.
  • 39:45So it's very important to.
  • 39:48Doodle
  • 39:51uh no.
  • 39:55No, yes, so now we wouldn't be
  • 39:58ready at the end of the of the time.
  • 40:01So one thing we could do is I I could
  • 40:04continue and finish the next time.
  • 40:09With the remaining of the analysis
  • 40:11steps, sure, yeah, I think Tom,
  • 40:12it's your own judgment to
  • 40:14how you want to proceed.
  • 40:15Like do you think it is a natural stuff?
  • 40:17Then you can stop if you think
  • 40:19you want to cover 5 more minutes,
  • 40:21go ahead and do that so.
  • 40:24I can give you like a sort of anticipation
  • 40:27on the on the following steps.
  • 40:29So uhm, many of these steps that.
  • 40:33I mean, many, many of these steps II
  • 40:36took inspiration from this review that
  • 40:39was recently published in a true method.
  • 40:42So it covers the main trials.
  • 40:44So the successes and also the
  • 40:47limitations of the computational
  • 40:49methods for the single cell RNA
  • 40:51seek analysis and so next time we'll
  • 40:54cover the key preprocessing steps.
  • 40:56We have seen this the molecular counting,
  • 40:59but we will see.
  • 41:01So how we can do quality control?
  • 41:04Remove Excel said that are suspicious,
  • 41:06for example because they are dying or
  • 41:09because they represent the empty droplets
  • 41:12or because they represented tablets.
  • 41:14So doublets occur when you didn't really
  • 41:17manage to separate physically the cells.
  • 41:19So there for example in the
  • 41:22same droplet you have two cells,
  • 41:24or for some reason the cell barcode
  • 41:27of two different cells was shared.
  • 41:30For some technical problem,
  • 41:32so we'll see methods to remove.
  • 41:34Dying cell and those who tablets then there
  • 41:37are problems related to the normalization.
  • 41:40So how to consider how to consider
  • 41:42the fact that you have a different
  • 41:45read different number of reads in
  • 41:47different cells and this is a problem
  • 41:50because the biologically speaking you
  • 41:52expect the cell of different types to
  • 41:55have a different amount over in Asia,
  • 41:58so expect some cells you have more
  • 42:00any molecules than other cells,
  • 42:03but most of the methods.
  • 42:05Assume that you needed to have to
  • 42:08have from each cell the same number of
  • 42:11reads or UMI. And then we will see how.
  • 42:14So how to remove a jeans that are
  • 42:17not important in the analysis?
  • 42:19This is very important because as
  • 42:20you can imagine in single cell you
  • 42:23have thousands of genes and also
  • 42:25thousands of cells are.
  • 42:26So your account matrix is very highly
  • 42:29dimensional and so all of these methods
  • 42:31try to reduce the number of cells
  • 42:33are keeping only the high quality
  • 42:35sales but also the number of genes.
  • 42:38So reducing like the dimensionality
  • 42:39of of your data.
  • 42:46Leash.
  • 42:48That sounds amazing. I look forward to
  • 42:51next week. Yeah. And also next week
  • 42:53when you see the downstream analysis
  • 42:55that for example class ring for single
  • 42:58cell approaches and also trajectory
  • 43:01possibly also trajectory estimation.
  • 43:05Yeah, thanks so much and nothing really.
  • 43:07Want to comment. Is your your slides
  • 43:09look very beautiful and I wish all my
  • 43:12slides and possibly on people in my life.
  • 43:14There's as well looks as nice when
  • 43:17I try and try to put a course
  • 43:19on this so it's kind of a
  • 43:21maybe we should use your slides as template.