MAPS_GSTH_part_I

Name: MAPS_GSTH_part_I
Uploaded: 2024-06-18T19:25:03.9933333Z
Duration: 1 h 23 min 27 s

June 18, 2024

Information

ID: 11813
To Cite: DCA Citation Guide

Download Transcript

00:00It is my pleasure to introduce
00:03the Hanan Jay Bhaskar Jay.
00:06He is a past doctoral researcher
00:08in the Department of Genetics
00:10at Yale School of Medicine.
00:12He has background strong quantitative
00:15background both in mathematical
00:18modeling and machine learning,
00:20anthropological data analysis
00:22with applications in biophysics
00:24and biomedical research.
00:26He received his PhD in Biomedical
00:28Engineering and Data and Master's degree
00:31in Data Science from Brown University,
00:34and for that he studied computer
00:36science and applied mathematics
00:38at University of British Columbia.
00:40This is going to be 4 parts series and
00:44we have a group of presenters who I will
00:48introduce for each section separately.
00:52And Jay, I'm going to give it to you.
00:54So if you want to add anything about
00:56overall research, you're welcome to.
00:58And if you guys,
01:00anybody has questions,
01:01you're welcome to type them in the
01:04chat or in question and answers.
01:06And Jay said that he will respond
01:08to them as they come.
01:10And thank you, Helen,
01:12for the very kind introduction.
01:14And welcome, everyone,
01:16to the first workshop in this series.
01:19Today, my goal is to introduce you
01:23to a methodology called topological
01:26data analysis and machine learning.
01:29Both of these techniques are,
01:32you know, very broad.
01:33They encompass a number of different
01:36methods and so there's no way I can
01:38cover both of them in any amount of
01:41detail in just a single session.
01:43My goal is to give you a broad
01:46overview of these techniques and
01:48to build some intuition and for
01:52us to share a common vocabulary.
01:55And then in the subsequent workshops,
01:57we are going to take in,
01:59take bits and pieces of topological
02:02data analysis, TDA in short,
02:04and machine learning that we need
02:08to analyze some neuroimaging data.
02:11And as Helen mentioned, my name is Tanan Jay.
02:14I generally go by Jay.
02:16I wear many different hats,
02:18but most relevant is I'm a
02:22postdoctoral fellow in neuroscience.
02:24I also work, I wanted to disclose,
02:28with Bohinger Ingelheim,
02:29which is a German pharmaceutical company,
02:32and I still maintain some
02:34affiliations with Brown,
02:35having completed my PhD there.
02:38So first, I wanted to just talk a little bit,
02:41give you a little bit of
02:43an introduction to myself.
02:43So as Helen mentioned,
02:45I come from a very quantitative
02:48dry lab background.
02:50I received my undergraduate degree in
02:52computer science and math and did a
02:55master's degree in applied mathematics.
02:57And in those years, a long,
03:00long time ago,
03:01I was interested in modelling
03:04biophysics and in particular,
03:06I was interested in developmental
03:08and cancer biology.
03:09So I spent a lot of my formative
03:12years putting together agent based
03:15models to simulate cell migration
03:18and cell morphology and emergence of
03:21different types of migratory patterns
03:24in normal healthy tissue and also
03:28various kinds of tumors and you know,
03:32across development and embryogenesis.
03:35Subsequently,
03:36I I moved over to Brown University
03:40where I was in the data science and
03:44biomedical engineering departments.
03:45And it was at Brown University
03:48where I became fascinated with some
03:51mathematical concepts to do with shape.
03:54And I realized during these years
03:58that learning the shape of the data
04:01and being able to quantify the shape
04:04of the data can be a really powerful
04:08tool for biomedical data analysis.
04:11And so for instance,
04:12if you have a bunch of data
04:15points that are arranged in this
04:17weird double Taurus like shape,
04:19being able to actually take these
04:22individual data points and be
04:24able to fill in the empty spaces.
04:27And to be able to recognize that
04:29there are two big holes in the data
04:31and that our data has a loop like
04:34structure where there's a bigger loop
04:36around which our data points are organized.
04:39And then there are smaller loops
04:41that surround those bigger loops.
04:43Being able to recognize these types of
04:46patterns can be extremely powerful,
04:48especially when we are dealing
04:50with biological data.
04:51And then recently moved to postdoc
04:56in in genetics and also I guess
04:59in computer science
05:00at at Yale University.
05:02And I spent my postdoc thinking a lot
05:05about data that is structured like a graph.
05:08And So what I mean by a graph
05:11here is a set of nodes and edges.
05:14So the nodes being represented as the
05:16circles and edges being the lines that
05:19are connecting different nodes together.
05:21There's lots and lots of data out there
05:24that can be represented in this format.
05:27For example, if you are looking
05:29to do drug discovery,
05:31you can represent molecules using
05:34nodes and and edges corresponding
05:37to atoms and bonds respectively.
05:40You can take protein sequences,
05:41fold them with alpha fold and then
05:44represent protein structure in this manner.
05:46But also you can take neuroscience
05:50data such as brain imaging data and
05:54divide brain into different parcels
05:56and learn to represent brain imaging
05:59data in this format where the nodes
06:02are going to represent different
06:03parcels or regions of the brain.
06:06And the edges could be anatomical
06:09connectivity between those
06:10parcels in the brain,
06:12or it could be due to functional connectivity
06:14between those parcels in the brain.
06:16Maybe at an even higher level,
06:18one can think of like taking biomedical
06:21data in general and representing it in
06:23in the format of a knowledge graph,
06:26where you can bring in data from,
06:28you know,
06:30publications from single cell sequencing
06:34experiments and other modalities
06:35and represent all of that data in
06:38a large biomedical knowledge graph.
06:40So during my postdoc,
06:42I've developed techniques to not
06:44only represent data as graphs,
06:46but also to develop machine learning
06:49techniques to learn to reason about
06:52these kinds of graphs and represent
06:54them in a way that a computer can
06:56understand the structure of the
06:58graph and take advantage of that
07:00to answer all kinds of questions.
07:02And so today I'm going to talk
07:05to you about a technique that is
07:09utilizing this graph structure and
07:11combining it with aspects of topology,
07:15which is essentially the technique that
07:17allows us to recognize the shape of our data.
07:20And so to motivate this,
07:23this technique that I'm going to
07:24talk about and we're going to
07:26develop over the next few workshops,
07:28I wanted to share with you some
07:30time lapse microscopy images that
07:33were taken a long time ago.
07:36And these are calcium imaging data
07:39sets of a developing zebrafish embryo.
07:43And if I could play these,
07:46not sure if I can hang on a second,
07:52OK, if I play these images,
07:55What you'll notice here is that on the left,
07:58you have a zebrafish embryo
08:00that's early in its development.
08:02In the middle, it's grown a little bit more.
08:05And on the right,
08:06the zebrafish embryo is much
08:07further along in its development.
08:09What you're going to notice is
08:11that the signalling patterns,
08:12the calcium signalling patterns across
08:15development look very different.
08:17In the video on the left,
08:19early in development we see that
08:22we see individual spiking events,
08:24so individual calcium signaling
08:26events that are not really
08:28correlated temporarily or specially.
08:31A little bit further along in development,
08:34we start to see patches of
08:36synchronous activity in the embryo.
08:39So you see these small patches,
08:40but they don't really travel very far.
08:43And even later in development you
08:45start to see these waves traveling
08:48wave like patterns where you have
08:50calcium signaling starting at a
08:52small group of cells and that
08:54really kind of expands and goes all
08:56across the embryo of the zebrafish.
08:58And even today,
09:00although we have really nice
09:02techniques for being able to capture
09:05this kind of imaging techniques,
09:07we don't really have good quantitative
09:11tools to be able to analyze the
09:15spatial temporal patterns that
09:17we see in these videos.
09:19Likewise for brain imaging,
09:21we have really well developed
09:23tools through fMRI nears EEG,
09:26all kinds of tools to be to image the brain.
09:33And we see here for example in this in
09:37this example that the brain activity
09:39patterns that we get in a healthy
09:42typically developed human and an
09:44individual who's suffering from Alzheimer's,
09:47they're very different.
09:48We don't really have a good tool,
09:52tool set to be able to analyze
09:54the spatial temporal dynamics that
09:57we are seeing across the brain.
09:59So this problem of like quantifying
10:03dynamics both spatially and temporally,
10:06this exists not just at a cellular
10:09and tissue scale in biology,
10:12but also at a systems and organ
10:15scale in neuroscience.
10:16And this is something that
10:18we wish to address.
10:20And So what are some of the challenges
10:23in these data sets and how do we go
10:26from these noisy high dimensional
10:29neuroimaging data sets to neural insights?
10:32And what I mean by neural insights
10:35here is really figuring out
10:37patterns of activity both spatially
10:40and temporally in the brain that
10:43correspond to various kinds of stimuli,
10:46various kinds of diseases,
10:49and various kinds of like tasks.
10:52And so ideally what we want to
10:54be able to do is,
10:55is to build a network that can take
10:59in patterns of brain activity and say
11:02that this pattern of brain activity
11:05corresponds to somebody who's maybe
11:08clicking their right thumb like this.
11:12And so the challenge is,
11:14is enormous because when we look
11:17at this kind of data,
11:19and this is again a brain imaging
11:21data set here,
11:22if you visualize the data,
11:23we see that there is a lot of noise
11:27in this data set.
11:29If we take just one voxel of
11:31this brain imaging
11:32data set and we visualize it over time,
11:35we see that we don't really see this nice
11:38clean line that we would like to see.
11:40In fact, we see that the
11:41data is all over the place.
11:43So we have to learn to be
11:45able to denoise this data set.
11:49The second thing we want to do is
11:51we want to be, we want to learn
11:54salient features of the data set.
11:56So in these neuroimaging data sets and also
11:59in calcium imaging and other data sets,
12:02not all features of the
12:05image are equally important.
12:07There are some features of the image
12:09that are salient to the task at hand,
12:11whether it's to diagnose individuals
12:13or to learn what kind of stimulus they
12:17they they're experiencing or to learn
12:20how to decode their brain activity into
12:23whatever stimulus that they experienced.
12:26So distilling the state space of the
12:29brain and learning salient features
12:31of this data set is very important.
12:35And finally, in neuroimaging in particular,
12:39we are always challenged by spatial
12:42versus temporal resolution.
12:44So we have techniques such as
12:47EEG which have very,
12:49very good temporal resolution but
12:52have very poor spatial resolution.
12:55On the other hand,
12:57we have techniques such as fMRI where
12:59the spatial resolution is amazing.
13:01We get thousands and thousands
13:03of voxels across the brain,
13:05but the temporal resolution of fMRI at
13:09around .5 Hertz is very low compared to EEG.
13:13So we want to develop techniques
13:16that can bridge bridge the gap
13:18between high spatial resolution
13:20and high temporal resolution.
13:22And we want to develop techniques
13:25that can perhaps integrate multiple
13:27modalities of data together so we can
13:30benefit from both high spatial resolution
13:33and also high temporal resolution.
13:38So those were just some of the motivating
13:41factors that in our lab led to the
13:44development of a technique called
13:48GSTHGSTH stands for geometric
13:51scattering trajectory Homology.
13:53And it's a bit of a mouthful.
13:55And over the next two or three workshops,
13:59we are going to go into all the
14:02components that form this methodology.
14:05And so today, just to begin with,
14:07I'll just give you a very short
14:10introduction for for what,
14:11for how this methodology kind of works.
14:14And so in this method,
14:16we start by creating a graph from our data.
14:20If we are dealing with
14:22some calcium imaging data,
14:24like imagine you are imaging
14:27calcium from the primary visual
14:29cortex of of a mouse and you're
14:31maybe imaging in like layer 4.
14:33Let's say you're going to
14:36get a sequence of images.
14:38And what you can do is you can use
14:41existing tools to segment those images
14:44so you know where the cells are located.
14:47And then you can build a graph.
14:49And by graph again,
14:51I mean nodes and edges by using the
14:54centroids of all the cells as nodes in
14:57the graph and putting an edge between
15:00any pair of nodes that share an edge.
15:03So any two cells that are adjacent
15:05to each other will be two nodes
15:08connected by an edge in the graph.
15:10Similarly,
15:11if you have some neuro imaging data set
15:14that you're looking to analyze with GSTH,
15:18what you can do is you can take the
15:21brain and you can convert it into
15:24parcels using your favorite Atlas.
15:26And so those individual parcels of the
15:29brain will form the nodes in the graph.
15:33And we are going to put an edge between
15:36any pair of parcels that are anatomically
15:39close to each other in the brain.
15:42So we start with a graph construction.
15:45Now each node in the graph will have
15:48a signal assigned to it and the signal
15:51is going to be a time varying signal.
15:54In the case of calcium imaging,
15:57for example,
15:57we are going to have as our signal
16:01the calcium activity over time.
16:04In the case of neuro,
16:06neuro imaging data sets,
16:07we are going to have averaged voxel
16:11activations within each parcel as
16:14our time lapse signal on the graph.
16:18And so in GSTH,
16:20what we do is we take that graph
16:22and we use some techniques in graph
16:25signal processing to convert the time
16:28lapse signal on the graph into some
16:31kind of numerical representation.
16:33So think of like taking this
16:35time lapse signal on the graph
16:37and coming up with a vector,
16:39which is nothing but a sequence
16:42of numbers that represent how that
16:45signal is distributed in the graph.
16:49And we're going to cover how this
16:52graph signal processing happens
16:53in the next workshop.
16:55But assuming you can do that,
16:57the next step in our in our methodology
17:00is to construct a trajectory of
17:05the dynamics using some non linear
17:08dimensionality reduction techniques.
17:10And again,
17:10this is something that we will cover
17:12in detail in subsequent workshops.
17:14But what's happening here is that
17:17we are representing the time
17:19lapse data that we started with
17:21in through a low dimensional trajectory.
17:24So in this case, I'm showing you a 3D
17:28trajectory and it's colored by time.
17:30And so we are saying that we start
17:32over here and we kind of move around,
17:35we go in a circle and we end up
17:38in this region of the space.
17:40And so recall how I talked to you
17:43earlier about denoising and learning
17:45the state space as being important
17:49challenges in in neural in neuroscience.
17:51Well, this graph signal processing
17:54in Step 2 effectively denoises
17:57the data set that we started with.
18:00And these trajectories,
18:02these low dimensional trajectories allow
18:05us to quantify where in state space we are.
18:10In particular,
18:11what I want to emphasize is that
18:14within these trajectories,
18:16anytime you get a loop in the trajectory,
18:20that means that your underlying
18:22signaling pattern has some kind
18:25of periodicity attached to it.
18:28Because we a loop structure in this
18:31low dimension indicates that we end
18:33up at the same state or close to the
18:37same state where we started from.
18:39So these trajectories are really quite
18:42informative and we can interpret the
18:45shape of these trajectories by looking
18:48at looking at our data through the
18:50lens of periodicity and quasi periodicity.
18:54So we recognize that the shape of
18:56these trajectories is very important.
18:58And in order to be able to compare
19:01across different data sets and across
19:03different subjects in an experiment,
19:05we need to find a way of quantifying
19:09the shape of the trajectory.
19:11And to quantify the shape of the trajectory,
19:14we use our topological data
19:17analysis as our main tool.
19:19And so topological data analysis is a
19:21technique that I'm going to cover today,
19:24which takes point cloud data.
19:27What I mean by point cloud data is just a
19:29bunch of points sitting in some dimension.
19:32In this case,
19:32these points are all in like 3 dimensional
19:35space and it it converts them into
19:38something called a persistence diagram.
19:40And this persistence diagram quantifies
19:43how connected those points are.
19:46And it also quantifies the shape of
19:49this data in the sense that it measures
19:53how loopy the trajectory is and whether
19:56or not that trajectory has any holes in it.
19:59So again,
20:00this might sound very abstract at this stage,
20:02but this is a technique that I'm going
20:04to talk about in more detail today,
20:06topological data analysis.
20:07And what we do then is we can take
20:10these topological features that are
20:13capturing the shape of our trajectory,
20:15and we can put them through some
20:18machine learning in order to be able
20:20to use GSTH as a diagnostic tool,
20:23for example.
20:24So what machine learning will do is it
20:26will take these topological features.
20:29And it will classify whether or not
20:32the individual that we are looking at
20:35is a typically developed individual
20:37or whether they have schizophrenia
20:40or they have OCD or Alzheimer's.
20:43What you can also do is you can
20:45use this technique to figure out
20:47whether or not the brain,
20:49is it in the resting state or
20:51is it engaged in some task.
20:52You can learn to figure out what task
20:55an individual is doing by quantifying
20:58the shape of these trajectories.
21:00And there are many,
21:01many other application areas that I'm
21:04sure you can think of applying this to.
21:07So just want to start with the
21:10workshop organization briefly.
21:12So we have two other fantastic speakers
21:16for our workshop, Rahul Singh,
21:19he's in the audience today and Brian
21:22Zabowski is also in the audience today.
21:24Rahul is a WOOSAI postdoctoral fellow.
21:27He will be talking to you next week
21:30and he'll be talking about graph
21:33signal processing methods that form
21:35the second step of our methodology.
21:37And then the following week,
21:39me and Brian,
21:41we will jointly present to you the
21:43entirety of the GSTH technique and we'll
21:47share with you several applications
21:49of GSTH both for cellular imaging data
21:53sets and also for neuro imaging data sets.
21:56And then of course, Helen will be around
21:59to facilitate all of these workshops.
22:01She's really the brains behind the operation.
22:04And so we have the 1st 3 workshops to cover
22:08different aspects of the GSTH methodology.
22:11We are starting from the end.
22:13So I'm going to talk about topological
22:15data analysis and machine learning today.
22:17Rahul within talked about
22:19graph signal processing.
22:20In the third workshop,
22:22we'll bring these things together
22:24and go over the complete GSTH
22:27methodology and its applications.
22:29And then the final week of the workshop,
22:31we are going to do a hands on
22:34tutorial where you'll get to load
22:37some neuro imaging data set and
22:40also cellular imaging data set and
22:43analyze it using GSTH in Python.
22:45And at the moment I think
22:47we're planning to make,
22:48we're planning to hold our 4th
22:50workshop as like a hybrid workshop
22:52that might have in person component.
22:55So we'll get back to you on on that and
22:58the location for that in subsequent weeks.
23:01Yes, I will send the series of
23:03emails where people can sign up
23:06for in person component. Great.
23:10All right. So we have few live participants.
23:13I understand that the majority of these
23:16workshops get viewed online over a period
23:18of like weeks and months and years.
23:21So please feel free to stop me
23:23anytime and to ask questions.
23:25And so as I mentioned, I'm going to start
23:29with with topological data analysis.
23:31And depending on how much time I
23:33have available to me, I might,
23:35I will also cover some fundamentals
23:37of machine learning just to make sure
23:40that everybody is on the same page
23:42and we all share the same vocabulary
23:44in the weeks going forward.
23:46So let's start with TDA.
23:49And I wanted to start by just showing
23:51you some of these point cloud examples.
23:54And so when I look at these data sets,
23:56what I see is that maybe in the first
23:59data set, we have two variables.
24:01Maybe we have an independent variable
24:04and a dependent variable that are
24:06strongly correlated together.
24:08And this to me looks kind of like
24:10a linear correlation,
24:11like a regression type of data set.
24:14When I look at the second data set here,
24:17what I'm recognizing is that
24:19the data set is clustered.
24:21We have a bunch of points that are
24:24grouped together and we have kind of
24:27three clusters of data in this data set.
24:29The third data set,
24:31to me looks cyclical.
24:33I can spot a circle in this data set,
24:37and that might indicate that perhaps
24:39this is some time lapse data set.
24:41Maybe there's some kind of oscillatory
24:43nature to this data set,
24:45and maybe we're going around in
24:47circles and the last data set
24:49here has this kind of Y shape.
24:52It looks like it's kind of branching out.
24:55This could be maybe some stem cells
24:57down here that are, you know,
24:59differentiating into two different lineages.
25:02It seems to have this tree like
25:04hyperbolic structure to it.
25:06And so our brains are really,
25:08really good at recognizing the
25:11shape of the data,
25:13especially when the data is presented
25:15to us in these low dimensions.
25:17And we understand fundamentally
25:19that any data that we have,
25:22that data has some shape,
25:24and the shape carries some meaning.
25:27And this really is the central
25:30tenet of topological data analysis,
25:33which is a branch of applied
25:36mathematics and computer science
25:38that has to do with understanding
25:41fundamentally the shape of our data.
25:44And underlying all of this is what
25:47we call the manifold hypothesis.
25:50The idea being that any scientific
25:53data that we collect in our lab is
25:57it might look very noisy and it
25:59might be very high dimensional.
26:01But quite often that scientific data
26:04is sampled from some low dimensional
26:07manifold And what we are really after
26:10is to understand what that manifold
26:13looks like and what the intrinsic
26:16dimension of that manifold is.
26:19So in this example here our manifold
26:21looks to be kind of saddle shaped and
26:25it has these two curvature areas.
26:27So it has a direction of positive curvature,
26:30a direction of negative curvature,
26:32and our data is simply
26:34sampled from this manifold.
26:36So what we really want to understand
26:38is the shape of the manifold.
26:40Another way to look at this is
26:42what we get in our experiments
26:45are individual data points,
26:48and those data points all
26:50together form some kind of shape.
26:52And what we really want to see
26:54is what that shape looks like.
26:56So in this case,
26:57all these data points form a
26:59torus and this is kind of,
27:01this is the realization that we
27:03are going to come to is that
27:05our data is arranged in the
27:07shape of a doughnut or a Taurus.
27:09So how do we actually go about doing that?
27:12Let me share with you the methodology
27:16using some very simple data sets that
27:19are easy to plot in A2 dimensional slide.
27:22And so we'll be working with these two
27:25data sets for the next few slides.
27:27The data set on the left,
27:29I'm going to call the concentric
27:31circles data set and and that's
27:34simply in recognition of the fact
27:36that these points are sampled
27:38from 2 circles where one circle
27:41is within another circle.
27:42And the data set on the right,
27:45I'm going to call the half moons data
27:48set simply because both of these,
27:51we have kind of two arcs in our
27:53data and they both look like kind
27:55of half moons or Crescent moons.
27:58And So what we want to do is we
28:00want to use a technique to recognize
28:02the fact that our data on the left
28:05is arranged in two circles.
28:07And the data on the right,
28:08it looks kind of circular,
28:10but it's not really two circles
28:14or one circle for that matter.
28:16And so one thing you might want
28:17to do is you want to,
28:18you might consider it like using a
28:20clustering method to see if that works,
28:22right?
28:22So you could take those data points
28:25and throw them into an algorithm,
28:27maybe something similar to K means.
28:29And you might see like,
28:30OK, does the data cluster?
28:32Well,
28:32if you run this data set through K means,
28:35you'll end up with these clusters,
28:37the blue cluster and the audience cluster.
28:40And these two clusters don't really
28:42tell you the true story behind the data.
28:45In particular,
28:46they don't recognize the fact that
28:48these data are arranged in two circles.
28:50And we,
28:51we even get some miss clustering
28:53happening in the data set on the right.
28:56Now you might then go back and say that,
28:57OK,
28:58I should use a different sort
29:00of clustering technique.
29:02Maybe I can cluster the data by its density.
29:05And so when you employ a density based
29:08clustering methods such as DB scan,
29:10you do indeed get the correct
29:13cluster labels for your data.
29:15You are able to separate data
29:16points in the inner circle from
29:18data points in the outer circle,
29:20and you are able to separate the
29:22data points belonging to the upper
29:24Crescent moon and the lower Crescent moon.
29:27Even then, the machine doesn't really know
29:30that the data is arranged as circles.
29:33It has no recognition of that.
29:35It has simply learned that your data
29:37is clustered into these two groups,
29:39but it doesn't fundamentally understand.
29:42What we can tell immediately is that this
29:45data is arranged in a circular pattern.
29:48And so this is where topology comes in.
29:52And so I'm going to talk
29:54to you about topology.
29:55And because I'm a very visual learner,
29:58I'm going to use some animations and
30:00some figures to kind of demonstrate how
30:03topology works without necessarily going
30:06into all the math and all the code behind it.
30:09We'll get to, we'll use some of
30:12the code in our third workshop.
30:14But honestly, like the code is
30:16something you import and you use.
30:18And so I think it's much more important to
30:21kind of build intuition around topology.
30:24So what we do in in in topology is we
30:27build something called simplicial complexes.
30:31And there's a number of different kinds of
30:34simplicial complexes that one can build.
30:36But I'm going to talk about the viatorius
30:39ribs simplicial complex to begin with today.
30:42And so to create a Viatorius ribs
30:45simplicial complex from your data,
30:47what you do is you start with a given
30:51data point and you imagine a disk of some
30:55radius epsilon around that data point.
30:58And you do this for every other
31:00data point in the data set.
31:02And you're going to grow this
31:05epsilon radius disk over time.
31:08And what you're going to do is when
31:11two epsilon radius disks intersect
31:13with each other, so they overlap,
31:16you're going to draw an edge between
31:19those two data points creating A1 simplex.
31:23When you have three points shown
31:26here as AB and C,
31:28and they their epsilon discs all
31:31intersect in a pair wise manner,
31:34then we're going to draw a filled
31:36in triangle which we are going to
31:39call A2 simplex.
31:40And then in higher dimensions,
31:42when we have four data points all
31:44intersecting in a pair wise manner,
31:47then we're going to draw a three simplex.
31:50So we are going to take our data set.
31:52In this case the data set happens to
31:55be two-dimensional and we are we are
31:58constructing these simplices from our
32:00data by expanding these epsilon radius
32:03discs around each point in the data set.
32:06And so in this visualization here,
32:09I'm simply showing you the 0
32:11simplex which which are the data
32:12points that we started from,
32:14and the one simplex which are all
32:17the edges that get created as we are
32:20expanding this epsilon radius disk.
32:22I'm not showing you the disk and I'm
32:25not showing you the field in triangles
32:27or the tetrahedrons simply because the
32:30the figure gets very, very crowded.
32:34So why do we want to construct this
32:38via Torres Ribs complex?
32:40Well,
32:41it turns out if you have some data shown
32:44here as these red dots that are sampled,
32:49this is your experimental data,
32:51and you imagine that this
32:53data is coming from some
32:55kind of underlying manifold.
32:57So there's a recognition here that
32:59whatever data we sample comes from
33:01a manifold that has maybe two holes
33:03in the middle of it, it turns out,
33:06and there's a theorem to prove this,
33:08although we're not going to go through.
33:10The proof of the theorem is that if your
33:13data is well sampled, so all these X,
33:16the points in X are sampled
33:18throughout the manifold quite well,
33:20then when you construct the
33:23viatorus rips complex from this
33:27data set for some radius epsilon,
33:30then this viatorus rips complex is basically
33:33equivalent to the underlying manifold.
33:36So in kind of more intuitive terms,
33:40what this theorem is saying is
33:42that if you want to learn the
33:45shape of your manifold where the
33:47data is being sampled from,
33:49it is sufficient to construct a Vietorisrips
33:53complex at some radius epsilon.
33:56And you will be able to find the manifold
33:59underneath the data and you'll be able
34:01to recognize the fact that your data
34:03is forming this one connected object
34:06which has two holes punched into it.
34:10OK, so let's get back to our example,
34:13the concentric circles example
34:15and the half moons example.
34:17And so here I'm showing you those
34:20epsilon radius discs around the data.
34:23So we have epsilon equals
34:26.05 at the beginning,
34:28we increase our epsilon value.
34:31And as we increase the epsilon value,
34:33these discs that I'm plotting in grade,
34:35they get bigger and bigger until
34:38they cover the whole space.
34:40And So what you can recognize here
34:43is that when our disk is quite small,
34:46even at epsilon equals .05,
34:49all the little points that
34:51are in the inner circle,
34:53they all get connected together
34:55because all of those disks are
34:58overlapping with each other.
35:00Then when we increase our epsilon to .15,
35:03the inner circles are still
35:06all connected together,
35:07but now the outer circles outer point.
35:11The points in the outer circle
35:13are also connected together.
35:14So we observe 2 loops in our data.
35:18As epsilon increases even more,
35:21these loops get closed in and they
35:24merge with each other until at the
35:27end when epsilon is really big,
35:29all of the disks intersect with
35:32each other and everything collapses
35:35into just one connected component
35:39in the two half moons data set.
35:42What we see is that there is a value
35:44of epsilon indeed where there is a
35:46small circle that forms as these
35:49points all get connected together
35:50in a pair wise manner.
35:52But that little circle quickly
35:55disappears when epsilon increases
35:57further and these two arcs get
36:00connected together into one hole.
36:02And so this is this technique
36:06is called persistence homology.
36:09And what it gives us is what we
36:12call a topological barcode.
36:15So there is code out there that will
36:18take these data points as an input,
36:21doesn't have to be two-dimensional
36:22or three-dimensional,
36:23could be high dimensional data
36:25and it will perform this kind of
36:28computation and give you back
36:29a visual that looks like this,
36:32which is the topological barcode.
36:34So let's kind of go through the
36:36topological barcode and learn
36:38how to interpret the barcode.
36:40The barcode consists of two parts.
36:42The top half I'm going to call H sub
36:45zero for dimension 0 homology and the
36:48lower part I'm going to call edge
36:51sub one for dimension 1 homology.
36:53And so in dimension 0 homology,
36:56we are measuring connectedness of
36:58our data and we generally call this
37:02number of connected components.
37:03And So what you can see here is
37:07that when epsilon is close to 0,
37:09where my cursor is,
37:11we see lots and lots of bars in our data set.
37:14And these bars correspond to how
37:17many connected components there
37:19are in our data set.
37:21So when epsilon is 0,
37:22all the points are sitting by themselves,
37:25none of the points are
37:27connected to each other.
37:28So we get as many bars as the
37:31number of points in our data.
37:33As epsilon starts increasing,
37:35we start merging together points
37:38by connecting them with an
37:41edge and forming A1 simplex.
37:43So as epsilon is increasing here
37:45you can see that the number of
37:48bars is fewer and fewer until at
37:51high values of epsilon we end up
37:53with just one bar in our barcode.
37:57So this dimension 0 homology,
38:00this is capturing the connectivity of
38:02our data and by looking at the slope
38:06by which these bars are decreasing in number,
38:09we can figure out how connected
38:11data our data set really is.
38:14In dimension 1,
38:15which is at the bottom of this barcode,
38:19what we are measuring is the
38:21presence of loops in our data set.
38:23So at epsilon equal to 0 on the very left,
38:26we have no bars in the lower
38:28part of this diagram,
38:30which means there are no loops
38:32present at that value of epsilon.
38:35At a later value of epsilon we
38:38see the occurrence of this 1st
38:40loop from these orange points
38:42in the inner concentric circle.
38:44That loop persists for a long period of time.
38:49What I mean by time is it persists
38:51for a large range of epsilon values.
38:55During this process,
38:56there is a second loop that forms,
38:59indicated by the second red bar.
39:01Here it emerges at at a higher
39:04value of epsilon,
39:06and this outer loop dies sooner
39:08than the inner loop does.
39:10The inner loop persists for even longer.
39:13So by looking at the bars in our bar code,
39:17we can learn that our data has so
39:20many points simply by counting the
39:23number of bars at epsilon equal to 0.
39:26We can learn how connected our data
39:29set is by looking at how these bars
39:32disappear as epsilon increases.
39:34And then in the lower part of the bar code,
39:37by looking at these bars,
39:40we can learn how many loops
39:42are present in our data.
39:43In particular,
39:44the bars that are longer in length
39:47actually represent actual loops
39:49that are present in our data.
39:51There are indeed some smaller
39:54bars which are small noisy
39:56loops that form as we perform this procedure.
40:00And what's apparent from these
40:02two barcodes is that in our first
40:05example with the concentric circles,
40:07there are two clear loops in that data.
40:11And in our second example, there is
40:13indeed a small loop that emerges here,
40:16but it quickly disappears,
40:18so there's no really topologically
40:21significant loops present
40:22in the second data set.
40:24And so these bar codes capture
40:26the shape of our data.
40:29You can continue to plot H2 and H3
40:33which are going to capture higher
40:35dimensional holes in your data.
40:37So H2 is going to capture 3 dimensional
40:41holes or voids in the data.
40:43H4 will capture even higher
40:45dimensional holes in the data.
40:46So topology captures the shape of our
40:49data by measuring connectedness and
40:50the presence of loops in the data.
40:53Are there any questions?
40:55Yes, Jay, just to kind of translate
40:57math into more intuition.
40:59When you say you have holes or loops,
41:02you're pretty much talking
41:04about some impossible states,
41:06meaning that your state cannot have this,
41:09like cannot be in the specific
41:11state for whatever reasons, right?
41:13Yeah, that's a great question.
41:15So I'm talking indeed about
41:17impossible states because these
41:19points are derived from experiments
41:21and they represent the state of our
41:24brain or the state of our tissue.
41:26And therefore if we have a hole in
41:29our data set, that means there's no
41:31data points present in the middle.
41:33And that means that there is that
41:35state is impossible as far as we can
41:38tell from our experimental data.
41:39So that's first conclusion.
41:42The 2nd conclusion which we can get is this.
41:46H1 measures kind of holes in two dimensions.
41:50And so that necessarily means that there
41:54is data that surrounds the hole, right?
41:56There must be some surrounding data and
41:58whenever there is data that surrounds a hole,
42:01that might indicate some kind
42:03of periodicity in the data set.
42:06So you can imagine that if you have data
42:08points that are arranged in a circle,
42:10doesn't have to be a perfect circle, it
42:12could be like an elliptical or skewed circle.
42:15This technique still works.
42:16But that that tells you that
42:19there is there is some sort of process.
42:21Yeah, there's a process that goes
42:23around in in a kind of periodic way.
42:26So you can navigate those that state
42:28space in a way that's periodic or
42:31almost periodic or quasi periodic.
42:34So impossible state.
42:35Well, as periodic states are being
42:38captured through dimension 1 homology
42:40in the status in this technique,
42:42indeed, I
42:44also have one question.
42:45So when we are increasing epsilon,
42:48yeah, are there some loops
42:52that are disappearing?
42:54Because if we increase epsilon,
42:56loops should not discover
42:58disappear, right loops
43:00can disappear. So the the way
43:02this outer loop is disappearing
43:04here is when there is a value of
43:07epsilon when one of the disks from
43:10the outer loop intersects with
43:12the disk from the inner loop.
43:14As soon as these two discs
43:16start intersecting,
43:17we draw an edge that goes from
43:19a point in the inner loop to a
43:21point on the outer loop and that
43:24effectively connects that those two
43:26loops together and the the enclosing
43:28space between the inner loop and the
43:31outer loop disappears at that stage.
43:33To get rid of the inner loop you have
43:36to increase epsilon a lot higher
43:38because you have two points that are
43:41opposite to each other in the inner loop,
43:44and when those two points
43:46get connected to each other,
43:47the inner loop closes in and disappears.
43:50So yes, loops can disappear,
43:53and in fact at a high enough value
43:56of epsilon, all loops will close up,
43:59and when epsilon is Infinity,
44:01all the points are necessarily
44:03intersecting with each other and there
44:05are no loops present in the data.
44:07Maybe this is a little bit more apparent
44:09in this previous animation when I was
44:12drawing the one simplex where you
44:14can see that at a higher value of epsilon,
44:16we do indeed end up closing the outer
44:19loop by connecting it to the inner loop.
44:22So you see that there's a bridge that forms
44:24from the outer loop to the inner loop,
44:25and that empty space here closes in
44:28and at a very high value of epsilon,
44:30there will be edges that
44:32go across the inner loop,
44:34closing the inner loop entirely.
44:36Although that doesn't happen in
44:38this animation because I didn't
44:40increase epsilon high enough.
44:43I suppose the question
44:45there is a, there is a question in the chat.
44:48And Zachary, would you like
44:49to ask your question live or
44:51would you like me to read that?
44:53I think I can read it.
44:54So the question says earlier you
44:57mentioned sufficient sampling of points
44:59as necessary to interpret a manifold.
45:01What you're showing now seems to
45:03address properties of the manifold.
45:04Is there a property to address how much
45:07variability is or isn't accounted for by
45:10this manifold characterization process,
45:12possibly due to under sampling?
45:14That's a great question.
45:16So what I'm presenting at the
45:19moment is under the assumption
45:22that our manifold is well sampled.
45:24So indeed, if your experimental
45:28procedure failed to sample a
45:30data point in the middle here,
45:32my conclusion would be that your
45:34data set has cellular states
45:36organized into two circles,
45:39and they're kind of
45:40independent of each other.
45:41There is indeed an outer circle,
45:43and that might be completely wrong
45:45simply because we never sampled data
45:48that exists between these two circles.
45:50So I am indeed operating under the
45:53assumption that the manifold is well sampled.
45:57Another aspect of the question was what
46:00what kind of like what properties of
46:03the manifold am I really capturing?
46:06And so I'm capturing topological
46:08properties of the manifold.
46:10And by topological properties I
46:12mean how connected that manifold
46:14is and whether or not there are
46:16holes in that manifold.
46:18And so this technique is like invariant
46:20to things like translation of the data.
46:23So if I take these circles and
46:25I translate them somewhere else,
46:27that doesn't impact it.
46:29If I take this diagram and I
46:31rotate it by 45° or 30°,
46:33that's not going to change
46:35the bar code at all.
46:36So it's translation invariant,
46:38it's rotationally invariant.
46:40And it's also invariant to certain
46:43kinds of deformations where if I take
46:46this arc and I deform it a little bit,
46:49that's not going to change my barcode
46:51and it's not going to change the
46:54fact that the data is connected
46:56in this Crescent moon shake.
46:58We'll get into more of the details
47:00of what aspects of the manifold we
47:03are capturing as we progress further.
47:04But I hope that kind of goes a little
47:07way towards answering your question.
47:11OK, so we have introduced this
47:14idea of a topological barcode.
47:16Next I want to show you a more convenient
47:20way of representing this barcode
47:22which is called a persistence diagram.
47:26And a persistence diagram is
47:28is very easy to construct.
47:29What you do is you draw two axes,
47:33the the X axis is called the
47:35birth axis and the Y axis is
47:37called the death axis generally.
47:39And these axes are going to represent
47:43when points start in the barcode.
47:45When, when do the bar start in
47:48the barcode and where do they end?
47:50And so you cannot complete or you cannot
47:53end a barcode before starting it.
47:57And because of that,
47:58all the points in this persistence diagram
48:01are going to happen above the diagonal.
48:04And there's two kinds of points here.
48:06I hope you can see that there's
48:08points that are represented as tiny
48:10circles and there are points that
48:11are represented as tiny diamonds,
48:13and so the circles are coming out
48:16of the H 0 dimension 0 homology,
48:19representing the connectedness of the data.
48:21They are all present here on the
48:24left side of the persistence diagram
48:26because all of these bars in H zero
48:29start at epsilon zero and then they end
48:32with some positive value of epsilon.
48:34Therefore,
48:35all of these points here represent
48:38H 0 and the points that are over
48:41here further away from zero,
48:44these are all representing H1.
48:47So I'm simply taking the starting
48:49coordinate of the bar and the
48:51ending coordinate of the bar,
48:52and I'm just representing it
48:54along these two axes.
48:56And this is what's called
48:58a persistence diagram.
49:00The conventional wisdom in persistence
49:03homology and topological data
49:05analysis is that points that are
49:09further away from the diagonal,
49:12they correspond to longer bars
49:14in the bar code,
49:16and those are the topologically more
49:19significant features in our data.
49:21So in this case,
49:23for the concentric circles example,
49:24we see two red diamonds that
49:26are far away from the diagonal,
49:29indicating the presence of two loops
49:32or two circles in our data set.
49:36This is the first,
49:38the topological barcode for the
49:39Half Moons data set,
49:41and this is the corresponding
49:43persistence diagram.
49:44You'll notice that this highlighted
49:46red diamond corresponding to that
49:49tiny loop that emerged for a bit
49:51is actually quite close to the
49:53diagonal in this persistence diagram,
49:55indicating that it's not very significant.
49:58Now,
49:59there are ways to compute statistics
50:01here and to figure out more quantity
50:04quantitatively whether or not a given
50:07topological feature is significant,
50:09but I'm not going to get into that today.
50:11There are bootstrapping methods
50:13that give you a confidence
50:15interval around the diagonal,
50:18and anything that falls inside of
50:20that confidence interval is going
50:23to be insignificant features.
50:25And anything that falls outside
50:27of that confidence interval is
50:29further away from the diagonal,
50:30and it's going to be topologically
50:33significant features.
50:34So those those kinds of tools exist,
50:36but I'm not getting into that today,
50:38just to build intuition.
50:42OK, so here's a small quiz that I like
50:44to do when I maybe present this also to
50:48undergraduates where I have three data
50:51sets and three persistence diagrams,
50:52but I've kind of jumbled them all up.
50:55And so let me just quickly tell you what the
50:57data sets are and what the diagrams are.
50:59I'll give you a moment to think,
51:01I'll drink some water,
51:02and then we'll go over the solution.
51:04So the first data set here,
51:07it's very hard to tell,
51:08but these are two spheres that where I
51:11have data sample from an inner sphere and
51:15a data sample from the outer sphere here.
51:19So it's in three dimensions.
51:21The second example are three circles where
51:23I have two circles that are concentric
51:26with one another and a separate circle
51:29that's outside of these two circles.
51:32And the third one is really hard
51:33to tell in this visualization,
51:35but this is data that's samples
51:37from the surface of a doughnut
51:40or a Taurus in mathematics.
51:42So this is again A3 dimensional object.
51:45It in the lower half,
51:46I'm showing you persistence diagrams where
51:49the black dots are representing dimension
51:520 homology and that's connected components.
51:56The red triangles are representing dimension
52:001 homology which are loops in our data.
52:04And now we have blue diamonds,
52:07and the blue diamonds are
52:09representing H2 dimension 2 homology,
52:12which are three-dimensional
52:14holes or voids in our data.
52:17And so I'd like you to think about
52:20matching these data sets to their
52:24corresponding persistence diagrams.
52:25A hint would be to look at
52:28the blue diamonds first,
52:29because blue diamonds are
52:31indicating 3D empty space.
52:33I'm going to take a quick drink of water
52:35and then we'll go over the solution.
52:49OK, So hopefully folks have realized
52:54that this persistence diagram on the left
52:59doesn't have any blue diamonds in it,
53:02doesn't have any 3D empty space in it,
53:05and therefore it corresponds to the second
53:08data set of the three concentric circles.
53:12You can see that there is 2 red triangles
53:14that are further away from the diagonal here,
53:17and there's one red triangle that's a
53:19little bit away from the diagonal here.
53:22And those three, those 3 triangles
53:25correspond to these three loops.
53:28One of the triangles is quite close
53:30to the diagonal because of the
53:31fact that these are concentric,
53:33so you can kind of bridge across
53:35them quite easily.
53:36Now we have 2 persistence diagrams
53:39that have diamonds in them,
53:41and to figure out which one is which,
53:44I think you have to look at some
53:47of these triangles again.
53:49And So what distinguishes the right one
53:51from the left one is on the right one.
53:53I'm not really seeing any triangles
53:55that are very far from the diagonal.
53:58But in this one,
53:59I see one triangle here that's
54:01far from the diagonal.
54:03And maybe there's another triangle
54:04here that's kind of separated from all
54:07the noise over here that's slightly
54:09further away from the diagonal.
54:11And so the way you can get these two
54:14triangles is because when you have a Taurus,
54:17if you think about a Taurus as a doughnut,
54:20you have a a circle that goes
54:24across the torus,
54:25like a horizontal circle
54:27going across the donor.
54:29And then you have another loop,
54:31another circle that goes kind of
54:34perpendicular to the 1st circle that
54:36goes around the donor in this way.
54:39So then in a donor,
54:40there are two loops and there's
54:43one empty space or one 3D hole.
54:46Whereas in the concentric spheres example,
54:49you just have the empty space between the
54:53two spheres that's shown by this diamond.
54:57So sorry,
54:58there's 22 empty spaces.
54:59There's empty space inside the inner sphere,
55:02and there's empty space between the
55:05outer sphere and the inner sphere.
55:07And that shows up here because you
55:10have one blue diamond up here,
55:12and then maybe you have one blue
55:14diamond here that's slightly
55:15further away from the diagonal.
55:17And so those correspond to the space
55:19inside the inner sphere and the
55:22interstitial space between the two spheres.
55:25So hopefully that helps you build
55:27some intuition.
55:27This is a lot easier to do when
55:29I put confidence intervals,
55:31but if I do put confidence intervals,
55:33we'll have to compute them separately for
55:36dimension 0, dimension 1, and dimension 2.
55:39And so that makes things.
55:40Also,
55:41I don't really have enough space
55:42to to draw 3 persistence diagrams
55:45with three confidence intervals,
55:47but hopefully that makes sense.
55:48If you have any question about how to
55:51interpret these persistence diagrams
55:52or like the solution to this little quiz,
55:55please feel free to chime in.
55:59OK, so now at this point,
56:03we we have figured out how to take
56:05our point cloud data set and convert
56:08it into a topological barcode,
56:11which we can represent
56:13as a persistence diagram.
56:14So the next thing we want to do is we
56:17want to compare two different data sets.
56:20And so comparing two different
56:22point clouds can be quite tricky.
56:24You don't really know like how to
56:27distinguish a torus from a sphere
56:30from some other blobby thing.
56:32And so one way in which you can
56:35compare these kinds of data to
56:37different point clouds is by instead
56:40comparing their persistence diagrams.
56:43And so there are multiple
56:46techniques to compute distances
56:48between persistence diagrams.
56:51And one of those techniques is what's
56:53called the bottleneck distance.
56:55And So what happens in the bottleneck
56:58distance is that you paired up.
57:01And again,
57:02I should apologize here because I'm
57:04using color slightly differently here.
57:06So the the blue colored dots are
57:09from first persistence diagram,
57:11diagram one and the red
57:13squares are from diagram 2.
57:16And so we want to compare
57:17diagram 1 to diagram 2.
57:19And the way we do that is by first
57:22matching features in diagram
57:231 to features in diagram 2,
57:26where we also allow to allow ourselves to
57:29map certain features to the diagonal itself.
57:32So that's a matching process that happens.
57:35And once you have matched the features
57:38to each other or to the diagonal,
57:40you find the two paired features that
57:43are furthest away from each other and
57:46you compute this distance between them.
57:48This is called the bottleneck distance,
57:51and there's ways to represent
57:53that mathematically.
57:54Here that's not so important.
57:56The intuition is probably
57:58what's most important,
57:59and there is a very important theorem
58:02in the field that guarantees stability
58:05is called the stability theorem.
58:08And what it says is that if I have a
58:11point cloud X and a point cloud Y,
58:14if my point cloud X is just a slight
58:16perturbation of point cloud Y.
58:18So I've just moved the points
58:20around a little bit.
58:21Then the bottleneck distance between
58:24the persistence diagrams computed
58:26through the Beatrice ribs complex of
58:28X and the Beatrice ribs complex of Y.
58:31This bottleneck distance is
58:33guaranteed to be small because if
58:36X is slightly different from Y,
58:39the right hand side of this
58:41equation is going to be close to 0,
58:42and therefore the bottleneck distance
58:44is going to be very very small.
58:47So this basically guarantees the fact
58:49that if you have one point cloud,
58:51maybe points arranged as a circle
58:53and you tweak the point slightly,
58:55so you add a little bit of noise
58:56to those points,
58:57then the bottleneck distance is
58:59not going to change much.
59:01So it means that this bottleneck distance
59:04is a stable way of comparing point clouds.
59:07Again, I don't care so much about the math.
59:09That's kind of the, the,
59:10the main result is that topology is,
59:13is able,
59:15it's,
59:15it's robust to these kinds of
59:18noise and perturbations in
59:20our data. Now one of the problems
59:22with the ball neck distance is that
59:25we are doing all this matching,
59:26but ultimately we are only really
59:28looking at the distance between points
59:30that are matched but are furthest away
59:33from each other and we're ignoring
59:35all the other points that got matched.
59:37So maybe you want to be more sensitive
59:40to how well the matching works and
59:43and the way to actually use that
59:46information is to compute what's
59:48called the washer steam distance,
59:50where you perform the matching process 1st.
59:54And then you compute this washer steam
59:57distance between diagram one and diagram 2,
59:59where you sum over the distance between
01:00:02all the points that are matched to
01:00:05each other and you ignore the points
01:00:07that get matched to the diagonal.
01:00:08So this is a much more even more stable
01:00:12way of comparing 2 persistence diagrams.
01:00:16And it also has similar stability
01:00:18properties that I was talking about
01:00:21previously in the Washington steam distance.
01:00:24If you have,
01:00:24if you have any kind of experience
01:00:27with like optimal transport theory
01:00:29or like statistics,
01:00:30then I just wanted to highlight that
01:00:33the washer steam distance that we're
01:00:35talking about here is similar to,
01:00:37it's actually exactly the same as
01:00:39the washer steam distance that you
01:00:40would be familiar with in the sense
01:00:42that you have a transport map and
01:00:44you're kind of moving mounds of
01:00:46earth from one place to another.
01:00:48Or the Washington steam distance that you
01:00:50used to compare to probability distributions.
01:00:52You can think about these topological
01:00:54features as like probability distributions
01:00:56and you're learning to map 1 to the other.
01:00:58So this is just an aside for folks who might
01:01:02be more familiar with optimal transport.
01:01:05OK,
01:01:07I wanted to put this like review
01:01:10article in because I have talked
01:01:13so far about some about extracting
01:01:16topological features from point cloud
01:01:18data set and then learning how to
01:01:21interpret those topological features
01:01:23and to compare two different data
01:01:26sets by computing the bottleneck
01:01:28distance or the washer steam distance
01:01:31between their topological features.
01:01:33But it is possible to also compute
01:01:37to use topology in a different way,
01:01:39where you use topology to inform the
01:01:43training of a machine learning architecture.
01:01:47And so for folks who are familiar
01:01:49with machine learning,
01:01:50I just wanted to point out that
01:01:53there are ways in which you can use
01:01:55topology inside the loss function
01:01:57of your neural network.
01:01:59So you can have a topology informed loss
01:02:02and a good example of this that is a
01:02:05paper by Channel in 2019 that I can point to.
01:02:08You can also in machine learning
01:02:11use topology to compare two
01:02:13different model architectures.
01:02:15One way of doing this would be to
01:02:17have two different like machine
01:02:19learning architectures where you
01:02:20look at the activations in all
01:02:23the layers of those architectures,
01:02:25treat that as a point cloud and compare
01:02:27them against each other using Washington
01:02:30distance of their persistence features.
01:02:32And that's in a good example of that is Zo
01:02:34ET al. In 2021.
01:02:36And then the what I'm talking about,
01:02:39which is similar to this paper in 2017,
01:02:41is to actually just take your data and
01:02:44use topology to featurize that data,
01:02:47which is to extract topological
01:02:49features of that data,
01:02:50learn to interpret those
01:02:52topological features,
01:02:53and then perhaps pass them into
01:02:55some machine learning framework
01:02:57to generate some kind of output.
01:02:59So there are different places in machine
01:03:01learning that where one can use topology.
01:03:04In our case,
01:03:05we are going to focus on ways in
01:03:07which we use topological features
01:03:09extracted from our data and pass
01:03:12them into machine learning in order
01:03:14to do some kind of downstream task.
01:03:17OK, so next I wanted to cover a few ways
01:03:21of taking these topological features
01:03:25and converting them into summaries.
01:03:28And the reason for doing that
01:03:30is because we want to use these
01:03:33topological features as input for
01:03:35machine learning down the line.
01:03:37And these diagrams that I've
01:03:39been drawing for you so far,
01:03:41they're easy to draw on a screen,
01:03:43but they're not really great for
01:03:45machine learning because if you
01:03:47have a bunch of different data sets,
01:03:49you're going to get different number of
01:03:52topological features for each data set.
01:03:54And you don't really know a way of
01:03:56like converting this into something
01:03:58that can go into machine learning.
01:03:59So folks have found various ways of
01:04:03taking these persistence diagrams
01:04:04and converting them into even more
01:04:08convenient representations that can be
01:04:10used for either with medical analysis,
01:04:13but more importantly for machine
01:04:14learning down the road.
01:04:16And one such representation is
01:04:18called the persistence landscape,
01:04:20where you are taking a diagram like this
01:04:23and converting that into a function.
01:04:25And the way you do that is,
01:04:26is quite simple really.
01:04:28You take each point and you draw a
01:04:3110th function based off of that point
01:04:33by connecting it to its X coordinate
01:04:36and connecting it to its Y coordinate
01:04:39intersected with the diagonal.
01:04:41It takes more words to describe,
01:04:43so you just you can just simply
01:04:45see it from this picture.
01:04:47You draw this little tent function
01:04:50and then tilt the diagram by 45° and
01:04:53there you end up getting this function
01:04:57representation of your persistence diagram.
01:05:00You can now treat this as a function,
01:05:02and you can use tools from functional
01:05:06analysis to analyze this persistence diagram.
01:05:09Again,
01:05:09you can kind of formalize this with a
01:05:12bunch of math by drawing out what A10
01:05:15function looks like and how you take
01:05:17a diagram and convert it into a function,
01:05:21but this is all just notation.
01:05:25I simply want to convey to you
01:05:27the intuition behind taking a
01:05:29diagram and converting it into a
01:05:32function for downstream analysis.
01:05:34There's some important reasons why one
01:05:36might want to convert this into a function.
01:05:39One of them is that you can use
01:05:41tools from functional analysis.
01:05:43Another thing that's important is
01:05:45that this is an injective mapping,
01:05:48and it satisfies the same properties
01:05:50that persistence diagrams satisfy.
01:05:54Another convenient way of converting
01:05:57persistence diagrams into something
01:05:59that's useful for machine learning is
01:06:02to convert the diagram into an image.
01:06:05The reason you might want to do this is
01:06:07because we have architectures that are very,
01:06:10very good at dealing with images.
01:06:12We know how to classify
01:06:14cats and dogs and horses.
01:06:16We also know how to generate images.
01:06:18So we can take advantage of all the tools
01:06:20we have developed for dealing with images
01:06:23in machine learning if we can convert
01:06:25persistence diagrams into an image.
01:06:27And the way one goes about doing that is
01:06:30you take your input persistence diagram,
01:06:33you tilt it again by 45°.
01:06:36So now you're measuring the birth
01:06:38coordinate and you're measuring
01:06:39distance from the diagonal, which we
01:06:41call persistence on the Y coordinate.
01:06:44So nothing fancy,
01:06:45just kind of tilting the diagram.
01:06:47Then what you do is at each point in
01:06:51the diagram, you drop a Gaussian,
01:06:54so like a 2D Gaussian,
01:06:56and you weigh the Gaussians by
01:06:58distance away from the X axis.
01:07:01So points that are higher
01:07:03up get a brighter Gaussian.
01:07:04The points that are lower down,
01:07:06they get a lower amplitude Gaussian.
01:07:09Again,
01:07:09the rationale for doing that is that
01:07:11points that are further away are points
01:07:14that are more topologically significant.
01:07:16Points that are closed are kind of
01:07:18derived from some noise in our data,
01:07:20and we want to be robust to noise.
01:07:22So it makes sense to weigh things
01:07:24by distance away from the diagonal.
01:07:26This is called a persistence image.
01:07:28This is still a continuous object.
01:07:30And So what you can do then is you
01:07:33can take this surface and you can
01:07:35just divide it into smaller pixels
01:07:38and convert this into an image format.
01:07:41And once you have this in an image format,
01:07:43you can use convolutional neural
01:07:45networks and other kinds of like
01:07:48generative AI tools to take advantage
01:07:50of like all of those tools to to
01:07:53work with these persistence images.
01:07:56Again,
01:07:56there's a bunch of math that one
01:07:58can write down to kind of formally
01:08:00describe this process,
01:08:01but I think the visuals do a much better job.
01:08:06Finally, you can convert your
01:08:08persistence diagrams into what are
01:08:11called smooth persistence curves.
01:08:13And the way this works is
01:08:16you walk along the diagonal,
01:08:18and as you're walking along the diagonal
01:08:21from the bottom left to the top right,
01:08:23you look at a window.
01:08:25And you construct the
01:08:26window by looking at this,
01:08:28like this rectangular section
01:08:30that's to the top left of
01:08:33wherever you are on the diagonal.
01:08:36And what you do is you compute some
01:08:38kind of statistic of points that
01:08:41exist within this little window.
01:08:42So one simple statistic would be
01:08:45simply counting the number of points
01:08:47that exist within this window.
01:08:49So that allows you to construct a
01:08:52function as you're walking from
01:08:55left to right, construct a curve,
01:08:58a continuous curve,
01:08:59which you can then analyze.
01:09:03I don't have the curve here for some reason,
01:09:06but you can imagine like as
01:09:07you're walking along the diagonal,
01:09:08just counting how many objects
01:09:10exist within this window over time
01:09:12gives you a continuous curve,
01:09:14which you can describe, but medically
01:09:15you can prove things about that curve.
01:09:17And one of the reasons why you might
01:09:19want to use this curve is that there
01:09:21are ways to speed this process up a lot
01:09:24because these are all like Gaussians.
01:09:26And so there's ways to compute
01:09:28this curve very, very fast.
01:09:30So that's also helpful for
01:09:33machine learning purposes.
01:09:35OK, Finally I just wanted to give you
01:09:38some sense of how you do topology for
01:09:42data that is not point cloud based data.
01:09:45What if you have like images
01:09:47that you want to work with?
01:09:50You can compute topology
01:09:52directly from images.
01:09:54So think of an image as nothing
01:09:56but a matrix of values, right?
01:09:58And the values are going to depict
01:10:00how bright a given pixel is.
01:10:02So if I have one here,
01:10:03that pixel is quite dark,
01:10:055 is brighter, three.
01:10:07Sorry,
01:10:08this is not really the color here
01:10:10doesn't really correspond to the value,
01:10:12but five would be a brighter pixel.
01:10:14Three would be slightly dimmer than five,
01:10:16but brighter than one.
01:10:17So you can think of an image as nothing
01:10:20but a matrix of image intensity values.
01:10:22And you can perform the same kind of
01:10:26filtration that we did previously by
01:10:28expanding that epsilon radius disk by
01:10:31going through this matrix of values
01:10:34and simply deleting everything that's
01:10:37above a value or below a value.
01:10:39And these are called sub level set
01:10:41or super level set filtrations,
01:10:43where in this example,
01:10:45any value that's only values that are
01:10:48less than or equal to one are shown
01:10:50where we spot two holes in our data.
01:10:52So five and three become holes.
01:10:55Then you increase your threshold to three.
01:10:57So now three gets filled in,
01:10:59but there's only one hole in the data set.
01:11:01And then as you increase your
01:11:03threshold to five is then both
01:11:05of those holes get filled in.
01:11:07So when you're working with images,
01:11:09you can construct what are
01:11:12called cubicle complexes,
01:11:13where you define a threshold
01:11:15value for your image,
01:11:16and by applying that threshold you
01:11:19can count in your pixels how many
01:11:22holes exist and you can quantify the
01:11:25shape of an image in that manner.
01:11:27I'll show you how this can be used in a
01:11:30very powerful way in a subsequent workshop.
01:11:33I just want to point to the paper here.
01:11:37So there's a nudist paper came out
01:11:39in 2020 by Bastian Riek and and
01:11:42other folks in the topology community
01:11:45where they took F MRI images,
01:11:47which were volumetric F MRI images.
01:11:50So there's lots and lots of data.
01:11:52And they performed this cubicle
01:11:54complex filtration of these
01:11:56volumetric images to construct a
01:11:59sequence of persistence diagrams,
01:12:01which they converted into persistence images.
01:12:04And from these persistence images,
01:12:06they were able to use many
01:12:08for learning techniques to
01:12:09categorize different
01:12:10brain state trajectories.
01:12:12So this is a very cool paper that
01:12:14combines a lot of the stuff that we
01:12:17talked about of going from images to
01:12:20persistence diagrams to persistence
01:12:21images and those images being used
01:12:24as input for machine learning to
01:12:26classify brain state trajectories.
01:12:29You can also take directly
01:12:31the persistence diagram,
01:12:32compute summaries of the persistence
01:12:34diagram such as persistence
01:12:36landscapes and persistence curves.
01:12:38This is a persistence curve here,
01:12:40and you can use those persistence
01:12:42curves directly to perform regression
01:12:44tasks such as estimating the severity
01:12:46of the disease in these F MRI images.
01:12:51Lastly, something that's going to be
01:12:54highly relevant to us going forward
01:12:56is doing TDA on time series data.
01:13:00So here's an example of
01:13:02a time series data set.
01:13:04These are just two sinusoidal
01:13:06curves, F1 and F2.
01:13:08You can see F1 has a higher amplitude,
01:13:11F2 has a smaller amplitude over time.
01:13:15And So what you can do is you
01:13:16can plot them against each other.
01:13:18So you can plot F1 against F2.
01:13:21And this is one way of
01:13:23taking time series data.
01:13:24You have to discretize in time of course,
01:13:26and converting them into
01:13:28a point cloud data set.
01:13:30And you can compute topology directly
01:13:32from this point cloud data set.
01:13:34So this works when you have two
01:13:36time series data sets, F1 and F2.
01:13:38You can convert that into a point cloud.
01:13:41If you have just one time series data set,
01:13:44what you can do is you can do a
01:13:46sliding window transformation.
01:13:48So you take a small sliding window,
01:13:50so a small chunk of the data,
01:13:52move that window forward 1 by 1 by 1.
01:13:55And for within that window,
01:13:57you can construct this phase portrait or
01:14:00this time delay embedding as it's called.
01:14:02And then you can take that loop and you
01:14:05can convert that into a persistence diagram.
01:14:08So I wanted to show you some
01:14:11examples of kind of doing this,
01:14:13right.
01:14:14So this is combining both cubicle
01:14:17homology and this time delay embedding
01:14:20the sliding window embedding and using
01:14:23that to compute persistence diagrams.
01:14:27And so again, I'm not going to
01:14:28go through all the details here,
01:14:30but I just wanted to show you an
01:14:32an application of this in practice.
01:14:34So this is a paper which was
01:14:36on archive in 2018.
01:14:37It might be out already.
01:14:39I hope it's out by now.
01:14:40And in the, in this data set,
01:14:42they were looking,
01:14:44they were imaging the vocal
01:14:46cords of humans as they were,
01:14:49they were making some sounds.
01:14:52And so when you're making like
01:14:54a rhythmic pattern of sounds,
01:14:56your vocal cords, they open, they close,
01:14:58they open and then they close.
01:15:00And so in this data set,
01:15:03obviously there's a periodic nature to
01:15:04the type of sound you're producing.
01:15:07And so if you look at these
01:15:09images of the vocal cords and you
01:15:11compute image self similarity,
01:15:12you can kind of guess that there
01:15:14is a period after which the
01:15:17image becomes similar to itself.
01:15:18And so you can kind of quantify the
01:15:21periodicity of the data in this way.
01:15:23But also,
01:15:23if you take the sequence of images
01:15:26and you do this time delay embedding
01:15:29and compute cubicle homology,
01:15:30you can end up with a persistence
01:15:33diagram where very clearly you
01:15:35see this H1 feature which tells
01:15:37you there's a loop in your data,
01:15:39which means that the data is periodic.
01:15:42What's cooler is when you do
01:15:45what's called bifonation.
01:15:46So in bifonation,
01:15:47the vocal cords move in a way
01:15:50that they produce two different
01:15:52frequencies at the same time.
01:15:54So you have like a high frequency,
01:15:57a sound coming out,
01:15:58and at the same time you have like
01:16:01a low frequency whine coming out.
01:16:03And I was thinking if I can do a
01:16:05demonstration of this kind of voice,
01:16:06but I really cannot do it.
01:16:08So you have to.
01:16:09If you search online for Bifor Nation,
01:16:11you will find examples of people who
01:16:13can produce both high frequencies
01:16:15and low frequencies at the same time.
01:16:17And if you look at the vocal cord
01:16:20images of producing this kind of sound,
01:16:22this is kind of what it looks like
01:16:24when you look at self similarity.
01:16:25You do observe a pattern here.
01:16:28But very importantly,
01:16:29when you take this data and you plug it
01:16:32through the techniques that I've described,
01:16:34you get a persistence
01:16:36diagram that looks like this,
01:16:38which has 2H1 features and one H2 feature.
01:16:42So remember,
01:16:44H1 is is the dimension 1 hole and
01:16:48H2 is dimension 2 hole or a void.
01:16:51And so from our previous quiz,
01:16:53hopefully you recall that 2H1 holes and
01:16:56one H2 hole means it's like a torus,
01:16:59which means there's empty hole in the
01:17:02middle and there are two loops in the torus.
01:17:04And that makes perfect sense for
01:17:06this data set because you have a
01:17:09high frequency and a low frequency
01:17:10forming 2 loops here.
01:17:12And then you have because it's
01:17:13arranged like a Taurus,
01:17:14both of those things are happening
01:17:16at the same time.
01:17:17You get a dimension to all in the
01:17:19data set or as it says in here,
01:17:21to a two cycle in the data set.
01:17:25And this again is from the same paper.
01:17:27This is an example where the person is
01:17:31showing irregular vocal fold vibrations.
01:17:34So there is no periodicity,
01:17:36no quasi periodicity.
01:17:38It appears random.
01:17:40When you look at image self similarity,
01:17:42it just goes along the diagonal.
01:17:44You don't see a lot of like important
01:17:47self similarity off the diagonal,
01:17:49which means that all of these images
01:17:51look kind of different from each other.
01:17:53And if you throw this into TDA,
01:17:55you get topological features that
01:17:57are very close to the diagonal.
01:18:00Again,
01:18:01you can compute like confidence
01:18:03intervals and so forth for these things.
01:18:06But again,
01:18:07it shows there's no interesting
01:18:09topology happening in this data
01:18:11set because it's irregular.
01:18:12There's no quasi periodicity
01:18:14or periodicity in this data.
01:18:19Lastly, topology is invertible
01:18:22to a certain extent.
01:18:23So folks often ask like, OK,
01:18:26I created this persistence diagram.
01:18:28I want to interpret the where these
01:18:31topological features come from.
01:18:33And you can do that using something
01:18:36called cycle representatives.
01:18:37And what cycle representatives
01:18:39allow you to do is they allow you to
01:18:42interrogate a specific topological
01:18:44feature and ask the question,
01:18:46where does that feature come
01:18:47from in your input data set?
01:18:50So for example,
01:18:51if we have this persistence diagram that's
01:18:53derived from this point cloud data set,
01:18:55you can then interrogate this
01:18:57topological feature and this it
01:19:00will tell you that this dimension
01:19:020 feature appears here because
01:19:04these two clusters of data became
01:19:07connected at that epsilon value.
01:19:09So the these two connected
01:19:11components disappeared,
01:19:12they merged together at that epsilon value.
01:19:16And likewise for dimension one,
01:19:18you can interrogate that topological
01:19:20feature and it will tell you
01:19:22that that particular loop is
01:19:24formed by these four points.
01:19:26This is very,
01:19:27very important for us because when we
01:19:29are dealing with the state space of
01:19:32cellular activity and neural activity,
01:19:34having access to these cycle
01:19:37representatives will allow us to,
01:19:39will will give us the ability to
01:19:41say which time points and which
01:19:44parts of the brain precisely led
01:19:46to the formation of a cycle which
01:19:49indicates periodic activity.
01:19:51So we can indeed go back in reverse
01:19:54from topological features back to
01:19:57our original data set and figure
01:19:59out why a certain topological
01:20:02feature exists in our data.
01:20:04And so I think we are at a stage here
01:20:07where we don't have a lot of time left.
01:20:10We're supposed to end at 5:30.
01:20:12So I'm not going to go through the ML parts.
01:20:14I had a few ML slides,
01:20:16but I think we can punt that to
01:20:18our third workshop.
01:20:19I think it would be a good time
01:20:21to take questions and end here.
01:20:23I just wanted to mention if you're,
01:20:24if you're about to leave,
01:20:25we're going to have another workshop
01:20:28next week that will be given by
01:20:30Rahul and he'll be telling you how
01:20:33we can take a graph consisting of
01:20:35nodes and edges and use graph signal
01:20:38processing to quantify how some
01:20:40signal is distributed on that graph.
01:20:43And then the following week,
01:20:45me and Brian,
01:20:46we're going to put all of
01:20:48these things together,
01:20:49talk about GSTH as a technique
01:20:52all combined together and we'll
01:20:54show you how we have used this
01:20:56technique with some of our data sets.
01:20:59Thanks for listening.
01:21:00Thanks for coming.
01:21:01Jay,
01:21:02I have a question.
01:21:03It may be a little bit premature,
01:21:05but is my intuition just
01:21:07correct my intuition?
01:21:09Do you understand correctly that
01:21:11if you have more organised,
01:21:13more complexly organized systems,
01:21:16you should see higher level holes?
01:21:21And if it's mostly random noise,
01:21:24you kind of don't really see much? Yeah.
01:21:27So if you have random noise, then in space,
01:21:30everything will get filled in, right?
01:21:33It's all just noise.
01:21:34So there'll be no structure to the data set
01:21:37and you won't see any holes in the data.
01:21:40The connectivity pattern
01:21:41also looks different.
01:21:42So you can do statistical tests where you
01:21:45can take real data from an experiment and
01:21:49compare that with topological features
01:21:51derived from like standard distributions,
01:21:54like uniform distribution
01:21:55and Gaussian distribution,
01:21:56and it will tell you that in your
01:21:58experiment in the state space,
01:22:00it kind of looks like a uniform distribution.
01:22:02There's no structure to it. Yeah.
01:22:04So that's that's one aspect.
01:22:06Also like dimension zero will tell you
01:22:08if in your trajectory here you have two
01:22:11different connected components, right.
01:22:13So if you have like one like set
01:22:15of States and then a completely
01:22:17different set of States and they're
01:22:19kind of far apart from each other.
01:22:21That's what we learned from like
01:22:23dimension 0 homology in addition to like,
01:22:25you know,
01:22:26noise and how the data is distributed.
01:22:28And dimension one will tell us these
01:22:30periodic loop like structures that
01:22:32might exist in our data and also the
01:22:34empty spaces being states that cannot
01:22:37really exist based off of like this
01:22:39this experimental data again being
01:22:41cognizant of the earlier question where
01:22:43if your data is not sampled correctly,
01:22:45it might be telling you the wrong thing.
01:22:49OK, thank you. Do you have any other
01:22:54questions in the chat or if anybody
01:22:56wants to ask any follow up questions?
01:23:01All right, I don't see any questions.
01:23:03Thank you so much.
01:23:05Just a note, all these papers that
01:23:08you mentioned in the presentation
01:23:09I'm going to add to your website.
01:23:12So if people are interested
01:23:14in looking at those papers,
01:23:16there will be links on your maps site.
01:23:21And thank you again.
01:23:22And I'll see you next week. See
01:23:24you next week. See you. Bye.