MAPS_GSTH_part_I
June 18, 2024Information
- ID
- 11813
- To Cite
- DCA Citation Guide
Transcript
- 00:00It is my pleasure to introduce
- 00:03the Hanan Jay Bhaskar Jay.
- 00:06He is a past doctoral researcher
- 00:08in the Department of Genetics
- 00:10at Yale School of Medicine.
- 00:12He has background strong quantitative
- 00:15background both in mathematical
- 00:18modeling and machine learning,
- 00:20anthropological data analysis
- 00:22with applications in biophysics
- 00:24and biomedical research.
- 00:26He received his PhD in Biomedical
- 00:28Engineering and Data and Master's degree
- 00:31in Data Science from Brown University,
- 00:34and for that he studied computer
- 00:36science and applied mathematics
- 00:38at University of British Columbia.
- 00:40This is going to be 4 parts series and
- 00:44we have a group of presenters who I will
- 00:48introduce for each section separately.
- 00:52And Jay, I'm going to give it to you.
- 00:54So if you want to add anything about
- 00:56overall research, you're welcome to.
- 00:58And if you guys,
- 01:00anybody has questions,
- 01:01you're welcome to type them in the
- 01:04chat or in question and answers.
- 01:06And Jay said that he will respond
- 01:08to them as they come.
- 01:10And thank you, Helen,
- 01:12for the very kind introduction.
- 01:14And welcome, everyone,
- 01:16to the first workshop in this series.
- 01:19Today, my goal is to introduce you
- 01:23to a methodology called topological
- 01:26data analysis and machine learning.
- 01:29Both of these techniques are,
- 01:32you know, very broad.
- 01:33They encompass a number of different
- 01:36methods and so there's no way I can
- 01:38cover both of them in any amount of
- 01:41detail in just a single session.
- 01:43My goal is to give you a broad
- 01:46overview of these techniques and
- 01:48to build some intuition and for
- 01:52us to share a common vocabulary.
- 01:55And then in the subsequent workshops,
- 01:57we are going to take in,
- 01:59take bits and pieces of topological
- 02:02data analysis, TDA in short,
- 02:04and machine learning that we need
- 02:08to analyze some neuroimaging data.
- 02:11And as Helen mentioned, my name is Tanan Jay.
- 02:14I generally go by Jay.
- 02:16I wear many different hats,
- 02:18but most relevant is I'm a
- 02:22postdoctoral fellow in neuroscience.
- 02:24I also work, I wanted to disclose,
- 02:28with Bohinger Ingelheim,
- 02:29which is a German pharmaceutical company,
- 02:32and I still maintain some
- 02:34affiliations with Brown,
- 02:35having completed my PhD there.
- 02:38So first, I wanted to just talk a little bit,
- 02:41give you a little bit of
- 02:43an introduction to myself.
- 02:43So as Helen mentioned,
- 02:45I come from a very quantitative
- 02:48dry lab background.
- 02:50I received my undergraduate degree in
- 02:52computer science and math and did a
- 02:55master's degree in applied mathematics.
- 02:57And in those years, a long,
- 03:00long time ago,
- 03:01I was interested in modelling
- 03:04biophysics and in particular,
- 03:06I was interested in developmental
- 03:08and cancer biology.
- 03:09So I spent a lot of my formative
- 03:12years putting together agent based
- 03:15models to simulate cell migration
- 03:18and cell morphology and emergence of
- 03:21different types of migratory patterns
- 03:24in normal healthy tissue and also
- 03:28various kinds of tumors and you know,
- 03:32across development and embryogenesis.
- 03:35Subsequently,
- 03:36I I moved over to Brown University
- 03:40where I was in the data science and
- 03:44biomedical engineering departments.
- 03:45And it was at Brown University
- 03:48where I became fascinated with some
- 03:51mathematical concepts to do with shape.
- 03:54And I realized during these years
- 03:58that learning the shape of the data
- 04:01and being able to quantify the shape
- 04:04of the data can be a really powerful
- 04:08tool for biomedical data analysis.
- 04:11And so for instance,
- 04:12if you have a bunch of data
- 04:15points that are arranged in this
- 04:17weird double Taurus like shape,
- 04:19being able to actually take these
- 04:22individual data points and be
- 04:24able to fill in the empty spaces.
- 04:27And to be able to recognize that
- 04:29there are two big holes in the data
- 04:31and that our data has a loop like
- 04:34structure where there's a bigger loop
- 04:36around which our data points are organized.
- 04:39And then there are smaller loops
- 04:41that surround those bigger loops.
- 04:43Being able to recognize these types of
- 04:46patterns can be extremely powerful,
- 04:48especially when we are dealing
- 04:50with biological data.
- 04:51And then recently moved to postdoc
- 04:56in in genetics and also I guess
- 04:59in computer science
- 05:00at at Yale University.
- 05:02And I spent my postdoc thinking a lot
- 05:05about data that is structured like a graph.
- 05:08And So what I mean by a graph
- 05:11here is a set of nodes and edges.
- 05:14So the nodes being represented as the
- 05:16circles and edges being the lines that
- 05:19are connecting different nodes together.
- 05:21There's lots and lots of data out there
- 05:24that can be represented in this format.
- 05:27For example, if you are looking
- 05:29to do drug discovery,
- 05:31you can represent molecules using
- 05:34nodes and and edges corresponding
- 05:37to atoms and bonds respectively.
- 05:40You can take protein sequences,
- 05:41fold them with alpha fold and then
- 05:44represent protein structure in this manner.
- 05:46But also you can take neuroscience
- 05:50data such as brain imaging data and
- 05:54divide brain into different parcels
- 05:56and learn to represent brain imaging
- 05:59data in this format where the nodes
- 06:02are going to represent different
- 06:03parcels or regions of the brain.
- 06:06And the edges could be anatomical
- 06:09connectivity between those
- 06:10parcels in the brain,
- 06:12or it could be due to functional connectivity
- 06:14between those parcels in the brain.
- 06:16Maybe at an even higher level,
- 06:18one can think of like taking biomedical
- 06:21data in general and representing it in
- 06:23in the format of a knowledge graph,
- 06:26where you can bring in data from,
- 06:28you know,
- 06:30publications from single cell sequencing
- 06:34experiments and other modalities
- 06:35and represent all of that data in
- 06:38a large biomedical knowledge graph.
- 06:40So during my postdoc,
- 06:42I've developed techniques to not
- 06:44only represent data as graphs,
- 06:46but also to develop machine learning
- 06:49techniques to learn to reason about
- 06:52these kinds of graphs and represent
- 06:54them in a way that a computer can
- 06:56understand the structure of the
- 06:58graph and take advantage of that
- 07:00to answer all kinds of questions.
- 07:02And so today I'm going to talk
- 07:05to you about a technique that is
- 07:09utilizing this graph structure and
- 07:11combining it with aspects of topology,
- 07:15which is essentially the technique that
- 07:17allows us to recognize the shape of our data.
- 07:20And so to motivate this,
- 07:23this technique that I'm going to
- 07:24talk about and we're going to
- 07:26develop over the next few workshops,
- 07:28I wanted to share with you some
- 07:30time lapse microscopy images that
- 07:33were taken a long time ago.
- 07:36And these are calcium imaging data
- 07:39sets of a developing zebrafish embryo.
- 07:43And if I could play these,
- 07:46not sure if I can hang on a second,
- 07:52OK, if I play these images,
- 07:55What you'll notice here is that on the left,
- 07:58you have a zebrafish embryo
- 08:00that's early in its development.
- 08:02In the middle, it's grown a little bit more.
- 08:05And on the right,
- 08:06the zebrafish embryo is much
- 08:07further along in its development.
- 08:09What you're going to notice is
- 08:11that the signalling patterns,
- 08:12the calcium signalling patterns across
- 08:15development look very different.
- 08:17In the video on the left,
- 08:19early in development we see that
- 08:22we see individual spiking events,
- 08:24so individual calcium signaling
- 08:26events that are not really
- 08:28correlated temporarily or specially.
- 08:31A little bit further along in development,
- 08:34we start to see patches of
- 08:36synchronous activity in the embryo.
- 08:39So you see these small patches,
- 08:40but they don't really travel very far.
- 08:43And even later in development you
- 08:45start to see these waves traveling
- 08:48wave like patterns where you have
- 08:50calcium signaling starting at a
- 08:52small group of cells and that
- 08:54really kind of expands and goes all
- 08:56across the embryo of the zebrafish.
- 08:58And even today,
- 09:00although we have really nice
- 09:02techniques for being able to capture
- 09:05this kind of imaging techniques,
- 09:07we don't really have good quantitative
- 09:11tools to be able to analyze the
- 09:15spatial temporal patterns that
- 09:17we see in these videos.
- 09:19Likewise for brain imaging,
- 09:21we have really well developed
- 09:23tools through fMRI nears EEG,
- 09:26all kinds of tools to be to image the brain.
- 09:33And we see here for example in this in
- 09:37this example that the brain activity
- 09:39patterns that we get in a healthy
- 09:42typically developed human and an
- 09:44individual who's suffering from Alzheimer's,
- 09:47they're very different.
- 09:48We don't really have a good tool,
- 09:52tool set to be able to analyze
- 09:54the spatial temporal dynamics that
- 09:57we are seeing across the brain.
- 09:59So this problem of like quantifying
- 10:03dynamics both spatially and temporally,
- 10:06this exists not just at a cellular
- 10:09and tissue scale in biology,
- 10:12but also at a systems and organ
- 10:15scale in neuroscience.
- 10:16And this is something that
- 10:18we wish to address.
- 10:20And So what are some of the challenges
- 10:23in these data sets and how do we go
- 10:26from these noisy high dimensional
- 10:29neuroimaging data sets to neural insights?
- 10:32And what I mean by neural insights
- 10:35here is really figuring out
- 10:37patterns of activity both spatially
- 10:40and temporally in the brain that
- 10:43correspond to various kinds of stimuli,
- 10:46various kinds of diseases,
- 10:49and various kinds of like tasks.
- 10:52And so ideally what we want to
- 10:54be able to do is,
- 10:55is to build a network that can take
- 10:59in patterns of brain activity and say
- 11:02that this pattern of brain activity
- 11:05corresponds to somebody who's maybe
- 11:08clicking their right thumb like this.
- 11:12And so the challenge is,
- 11:14is enormous because when we look
- 11:17at this kind of data,
- 11:19and this is again a brain imaging
- 11:21data set here,
- 11:22if you visualize the data,
- 11:23we see that there is a lot of noise
- 11:27in this data set.
- 11:29If we take just one voxel of
- 11:31this brain imaging
- 11:32data set and we visualize it over time,
- 11:35we see that we don't really see this nice
- 11:38clean line that we would like to see.
- 11:40In fact, we see that the
- 11:41data is all over the place.
- 11:43So we have to learn to be
- 11:45able to denoise this data set.
- 11:49The second thing we want to do is
- 11:51we want to be, we want to learn
- 11:54salient features of the data set.
- 11:56So in these neuroimaging data sets and also
- 11:59in calcium imaging and other data sets,
- 12:02not all features of the
- 12:05image are equally important.
- 12:07There are some features of the image
- 12:09that are salient to the task at hand,
- 12:11whether it's to diagnose individuals
- 12:13or to learn what kind of stimulus they
- 12:17they they're experiencing or to learn
- 12:20how to decode their brain activity into
- 12:23whatever stimulus that they experienced.
- 12:26So distilling the state space of the
- 12:29brain and learning salient features
- 12:31of this data set is very important.
- 12:35And finally, in neuroimaging in particular,
- 12:39we are always challenged by spatial
- 12:42versus temporal resolution.
- 12:44So we have techniques such as
- 12:47EEG which have very,
- 12:49very good temporal resolution but
- 12:52have very poor spatial resolution.
- 12:55On the other hand,
- 12:57we have techniques such as fMRI where
- 12:59the spatial resolution is amazing.
- 13:01We get thousands and thousands
- 13:03of voxels across the brain,
- 13:05but the temporal resolution of fMRI at
- 13:09around .5 Hertz is very low compared to EEG.
- 13:13So we want to develop techniques
- 13:16that can bridge bridge the gap
- 13:18between high spatial resolution
- 13:20and high temporal resolution.
- 13:22And we want to develop techniques
- 13:25that can perhaps integrate multiple
- 13:27modalities of data together so we can
- 13:30benefit from both high spatial resolution
- 13:33and also high temporal resolution.
- 13:38So those were just some of the motivating
- 13:41factors that in our lab led to the
- 13:44development of a technique called
- 13:48GSTHGSTH stands for geometric
- 13:51scattering trajectory Homology.
- 13:53And it's a bit of a mouthful.
- 13:55And over the next two or three workshops,
- 13:59we are going to go into all the
- 14:02components that form this methodology.
- 14:05And so today, just to begin with,
- 14:07I'll just give you a very short
- 14:10introduction for for what,
- 14:11for how this methodology kind of works.
- 14:14And so in this method,
- 14:16we start by creating a graph from our data.
- 14:20If we are dealing with
- 14:22some calcium imaging data,
- 14:24like imagine you are imaging
- 14:27calcium from the primary visual
- 14:29cortex of of a mouse and you're
- 14:31maybe imaging in like layer 4.
- 14:33Let's say you're going to
- 14:36get a sequence of images.
- 14:38And what you can do is you can use
- 14:41existing tools to segment those images
- 14:44so you know where the cells are located.
- 14:47And then you can build a graph.
- 14:49And by graph again,
- 14:51I mean nodes and edges by using the
- 14:54centroids of all the cells as nodes in
- 14:57the graph and putting an edge between
- 15:00any pair of nodes that share an edge.
- 15:03So any two cells that are adjacent
- 15:05to each other will be two nodes
- 15:08connected by an edge in the graph.
- 15:10Similarly,
- 15:11if you have some neuro imaging data set
- 15:14that you're looking to analyze with GSTH,
- 15:18what you can do is you can take the
- 15:21brain and you can convert it into
- 15:24parcels using your favorite Atlas.
- 15:26And so those individual parcels of the
- 15:29brain will form the nodes in the graph.
- 15:33And we are going to put an edge between
- 15:36any pair of parcels that are anatomically
- 15:39close to each other in the brain.
- 15:42So we start with a graph construction.
- 15:45Now each node in the graph will have
- 15:48a signal assigned to it and the signal
- 15:51is going to be a time varying signal.
- 15:54In the case of calcium imaging,
- 15:57for example,
- 15:57we are going to have as our signal
- 16:01the calcium activity over time.
- 16:04In the case of neuro,
- 16:06neuro imaging data sets,
- 16:07we are going to have averaged voxel
- 16:11activations within each parcel as
- 16:14our time lapse signal on the graph.
- 16:18And so in GSTH,
- 16:20what we do is we take that graph
- 16:22and we use some techniques in graph
- 16:25signal processing to convert the time
- 16:28lapse signal on the graph into some
- 16:31kind of numerical representation.
- 16:33So think of like taking this
- 16:35time lapse signal on the graph
- 16:37and coming up with a vector,
- 16:39which is nothing but a sequence
- 16:42of numbers that represent how that
- 16:45signal is distributed in the graph.
- 16:49And we're going to cover how this
- 16:52graph signal processing happens
- 16:53in the next workshop.
- 16:55But assuming you can do that,
- 16:57the next step in our in our methodology
- 17:00is to construct a trajectory of
- 17:05the dynamics using some non linear
- 17:08dimensionality reduction techniques.
- 17:10And again,
- 17:10this is something that we will cover
- 17:12in detail in subsequent workshops.
- 17:14But what's happening here is that
- 17:17we are representing the time
- 17:19lapse data that we started with
- 17:21in through a low dimensional trajectory.
- 17:24So in this case, I'm showing you a 3D
- 17:28trajectory and it's colored by time.
- 17:30And so we are saying that we start
- 17:32over here and we kind of move around,
- 17:35we go in a circle and we end up
- 17:38in this region of the space.
- 17:40And so recall how I talked to you
- 17:43earlier about denoising and learning
- 17:45the state space as being important
- 17:49challenges in in neural in neuroscience.
- 17:51Well, this graph signal processing
- 17:54in Step 2 effectively denoises
- 17:57the data set that we started with.
- 18:00And these trajectories,
- 18:02these low dimensional trajectories allow
- 18:05us to quantify where in state space we are.
- 18:10In particular,
- 18:11what I want to emphasize is that
- 18:14within these trajectories,
- 18:16anytime you get a loop in the trajectory,
- 18:20that means that your underlying
- 18:22signaling pattern has some kind
- 18:25of periodicity attached to it.
- 18:28Because we a loop structure in this
- 18:31low dimension indicates that we end
- 18:33up at the same state or close to the
- 18:37same state where we started from.
- 18:39So these trajectories are really quite
- 18:42informative and we can interpret the
- 18:45shape of these trajectories by looking
- 18:48at looking at our data through the
- 18:50lens of periodicity and quasi periodicity.
- 18:54So we recognize that the shape of
- 18:56these trajectories is very important.
- 18:58And in order to be able to compare
- 19:01across different data sets and across
- 19:03different subjects in an experiment,
- 19:05we need to find a way of quantifying
- 19:09the shape of the trajectory.
- 19:11And to quantify the shape of the trajectory,
- 19:14we use our topological data
- 19:17analysis as our main tool.
- 19:19And so topological data analysis is a
- 19:21technique that I'm going to cover today,
- 19:24which takes point cloud data.
- 19:27What I mean by point cloud data is just a
- 19:29bunch of points sitting in some dimension.
- 19:32In this case,
- 19:32these points are all in like 3 dimensional
- 19:35space and it it converts them into
- 19:38something called a persistence diagram.
- 19:40And this persistence diagram quantifies
- 19:43how connected those points are.
- 19:46And it also quantifies the shape of
- 19:49this data in the sense that it measures
- 19:53how loopy the trajectory is and whether
- 19:56or not that trajectory has any holes in it.
- 19:59So again,
- 20:00this might sound very abstract at this stage,
- 20:02but this is a technique that I'm going
- 20:04to talk about in more detail today,
- 20:06topological data analysis.
- 20:07And what we do then is we can take
- 20:10these topological features that are
- 20:13capturing the shape of our trajectory,
- 20:15and we can put them through some
- 20:18machine learning in order to be able
- 20:20to use GSTH as a diagnostic tool,
- 20:23for example.
- 20:24So what machine learning will do is it
- 20:26will take these topological features.
- 20:29And it will classify whether or not
- 20:32the individual that we are looking at
- 20:35is a typically developed individual
- 20:37or whether they have schizophrenia
- 20:40or they have OCD or Alzheimer's.
- 20:43What you can also do is you can
- 20:45use this technique to figure out
- 20:47whether or not the brain,
- 20:49is it in the resting state or
- 20:51is it engaged in some task.
- 20:52You can learn to figure out what task
- 20:55an individual is doing by quantifying
- 20:58the shape of these trajectories.
- 21:00And there are many,
- 21:01many other application areas that I'm
- 21:04sure you can think of applying this to.
- 21:07So just want to start with the
- 21:10workshop organization briefly.
- 21:12So we have two other fantastic speakers
- 21:16for our workshop, Rahul Singh,
- 21:19he's in the audience today and Brian
- 21:22Zabowski is also in the audience today.
- 21:24Rahul is a WOOSAI postdoctoral fellow.
- 21:27He will be talking to you next week
- 21:30and he'll be talking about graph
- 21:33signal processing methods that form
- 21:35the second step of our methodology.
- 21:37And then the following week,
- 21:39me and Brian,
- 21:41we will jointly present to you the
- 21:43entirety of the GSTH technique and we'll
- 21:47share with you several applications
- 21:49of GSTH both for cellular imaging data
- 21:53sets and also for neuro imaging data sets.
- 21:56And then of course, Helen will be around
- 21:59to facilitate all of these workshops.
- 22:01She's really the brains behind the operation.
- 22:04And so we have the 1st 3 workshops to cover
- 22:08different aspects of the GSTH methodology.
- 22:11We are starting from the end.
- 22:13So I'm going to talk about topological
- 22:15data analysis and machine learning today.
- 22:17Rahul within talked about
- 22:19graph signal processing.
- 22:20In the third workshop,
- 22:22we'll bring these things together
- 22:24and go over the complete GSTH
- 22:27methodology and its applications.
- 22:29And then the final week of the workshop,
- 22:31we are going to do a hands on
- 22:34tutorial where you'll get to load
- 22:37some neuro imaging data set and
- 22:40also cellular imaging data set and
- 22:43analyze it using GSTH in Python.
- 22:45And at the moment I think
- 22:47we're planning to make,
- 22:48we're planning to hold our 4th
- 22:50workshop as like a hybrid workshop
- 22:52that might have in person component.
- 22:55So we'll get back to you on on that and
- 22:58the location for that in subsequent weeks.
- 23:01Yes, I will send the series of
- 23:03emails where people can sign up
- 23:06for in person component. Great.
- 23:10All right. So we have few live participants.
- 23:13I understand that the majority of these
- 23:16workshops get viewed online over a period
- 23:18of like weeks and months and years.
- 23:21So please feel free to stop me
- 23:23anytime and to ask questions.
- 23:25And so as I mentioned, I'm going to start
- 23:29with with topological data analysis.
- 23:31And depending on how much time I
- 23:33have available to me, I might,
- 23:35I will also cover some fundamentals
- 23:37of machine learning just to make sure
- 23:40that everybody is on the same page
- 23:42and we all share the same vocabulary
- 23:44in the weeks going forward.
- 23:46So let's start with TDA.
- 23:49And I wanted to start by just showing
- 23:51you some of these point cloud examples.
- 23:54And so when I look at these data sets,
- 23:56what I see is that maybe in the first
- 23:59data set, we have two variables.
- 24:01Maybe we have an independent variable
- 24:04and a dependent variable that are
- 24:06strongly correlated together.
- 24:08And this to me looks kind of like
- 24:10a linear correlation,
- 24:11like a regression type of data set.
- 24:14When I look at the second data set here,
- 24:17what I'm recognizing is that
- 24:19the data set is clustered.
- 24:21We have a bunch of points that are
- 24:24grouped together and we have kind of
- 24:27three clusters of data in this data set.
- 24:29The third data set,
- 24:31to me looks cyclical.
- 24:33I can spot a circle in this data set,
- 24:37and that might indicate that perhaps
- 24:39this is some time lapse data set.
- 24:41Maybe there's some kind of oscillatory
- 24:43nature to this data set,
- 24:45and maybe we're going around in
- 24:47circles and the last data set
- 24:49here has this kind of Y shape.
- 24:52It looks like it's kind of branching out.
- 24:55This could be maybe some stem cells
- 24:57down here that are, you know,
- 24:59differentiating into two different lineages.
- 25:02It seems to have this tree like
- 25:04hyperbolic structure to it.
- 25:06And so our brains are really,
- 25:08really good at recognizing the
- 25:11shape of the data,
- 25:13especially when the data is presented
- 25:15to us in these low dimensions.
- 25:17And we understand fundamentally
- 25:19that any data that we have,
- 25:22that data has some shape,
- 25:24and the shape carries some meaning.
- 25:27And this really is the central
- 25:30tenet of topological data analysis,
- 25:33which is a branch of applied
- 25:36mathematics and computer science
- 25:38that has to do with understanding
- 25:41fundamentally the shape of our data.
- 25:44And underlying all of this is what
- 25:47we call the manifold hypothesis.
- 25:50The idea being that any scientific
- 25:53data that we collect in our lab is
- 25:57it might look very noisy and it
- 25:59might be very high dimensional.
- 26:01But quite often that scientific data
- 26:04is sampled from some low dimensional
- 26:07manifold And what we are really after
- 26:10is to understand what that manifold
- 26:13looks like and what the intrinsic
- 26:16dimension of that manifold is.
- 26:19So in this example here our manifold
- 26:21looks to be kind of saddle shaped and
- 26:25it has these two curvature areas.
- 26:27So it has a direction of positive curvature,
- 26:30a direction of negative curvature,
- 26:32and our data is simply
- 26:34sampled from this manifold.
- 26:36So what we really want to understand
- 26:38is the shape of the manifold.
- 26:40Another way to look at this is
- 26:42what we get in our experiments
- 26:45are individual data points,
- 26:48and those data points all
- 26:50together form some kind of shape.
- 26:52And what we really want to see
- 26:54is what that shape looks like.
- 26:56So in this case,
- 26:57all these data points form a
- 26:59torus and this is kind of,
- 27:01this is the realization that we
- 27:03are going to come to is that
- 27:05our data is arranged in the
- 27:07shape of a doughnut or a Taurus.
- 27:09So how do we actually go about doing that?
- 27:12Let me share with you the methodology
- 27:16using some very simple data sets that
- 27:19are easy to plot in A2 dimensional slide.
- 27:22And so we'll be working with these two
- 27:25data sets for the next few slides.
- 27:27The data set on the left,
- 27:29I'm going to call the concentric
- 27:31circles data set and and that's
- 27:34simply in recognition of the fact
- 27:36that these points are sampled
- 27:38from 2 circles where one circle
- 27:41is within another circle.
- 27:42And the data set on the right,
- 27:45I'm going to call the half moons data
- 27:48set simply because both of these,
- 27:51we have kind of two arcs in our
- 27:53data and they both look like kind
- 27:55of half moons or Crescent moons.
- 27:58And So what we want to do is we
- 28:00want to use a technique to recognize
- 28:02the fact that our data on the left
- 28:05is arranged in two circles.
- 28:07And the data on the right,
- 28:08it looks kind of circular,
- 28:10but it's not really two circles
- 28:14or one circle for that matter.
- 28:16And so one thing you might want
- 28:17to do is you want to,
- 28:18you might consider it like using a
- 28:20clustering method to see if that works,
- 28:22right?
- 28:22So you could take those data points
- 28:25and throw them into an algorithm,
- 28:27maybe something similar to K means.
- 28:29And you might see like,
- 28:30OK, does the data cluster?
- 28:32Well,
- 28:32if you run this data set through K means,
- 28:35you'll end up with these clusters,
- 28:37the blue cluster and the audience cluster.
- 28:40And these two clusters don't really
- 28:42tell you the true story behind the data.
- 28:45In particular,
- 28:46they don't recognize the fact that
- 28:48these data are arranged in two circles.
- 28:50And we,
- 28:51we even get some miss clustering
- 28:53happening in the data set on the right.
- 28:56Now you might then go back and say that,
- 28:57OK,
- 28:58I should use a different sort
- 29:00of clustering technique.
- 29:02Maybe I can cluster the data by its density.
- 29:05And so when you employ a density based
- 29:08clustering methods such as DB scan,
- 29:10you do indeed get the correct
- 29:13cluster labels for your data.
- 29:15You are able to separate data
- 29:16points in the inner circle from
- 29:18data points in the outer circle,
- 29:20and you are able to separate the
- 29:22data points belonging to the upper
- 29:24Crescent moon and the lower Crescent moon.
- 29:27Even then, the machine doesn't really know
- 29:30that the data is arranged as circles.
- 29:33It has no recognition of that.
- 29:35It has simply learned that your data
- 29:37is clustered into these two groups,
- 29:39but it doesn't fundamentally understand.
- 29:42What we can tell immediately is that this
- 29:45data is arranged in a circular pattern.
- 29:48And so this is where topology comes in.
- 29:52And so I'm going to talk
- 29:54to you about topology.
- 29:55And because I'm a very visual learner,
- 29:58I'm going to use some animations and
- 30:00some figures to kind of demonstrate how
- 30:03topology works without necessarily going
- 30:06into all the math and all the code behind it.
- 30:09We'll get to, we'll use some of
- 30:12the code in our third workshop.
- 30:14But honestly, like the code is
- 30:16something you import and you use.
- 30:18And so I think it's much more important to
- 30:21kind of build intuition around topology.
- 30:24So what we do in in in topology is we
- 30:27build something called simplicial complexes.
- 30:31And there's a number of different kinds of
- 30:34simplicial complexes that one can build.
- 30:36But I'm going to talk about the viatorius
- 30:39ribs simplicial complex to begin with today.
- 30:42And so to create a Viatorius ribs
- 30:45simplicial complex from your data,
- 30:47what you do is you start with a given
- 30:51data point and you imagine a disk of some
- 30:55radius epsilon around that data point.
- 30:58And you do this for every other
- 31:00data point in the data set.
- 31:02And you're going to grow this
- 31:05epsilon radius disk over time.
- 31:08And what you're going to do is when
- 31:11two epsilon radius disks intersect
- 31:13with each other, so they overlap,
- 31:16you're going to draw an edge between
- 31:19those two data points creating A1 simplex.
- 31:23When you have three points shown
- 31:26here as AB and C,
- 31:28and they their epsilon discs all
- 31:31intersect in a pair wise manner,
- 31:34then we're going to draw a filled
- 31:36in triangle which we are going to
- 31:39call A2 simplex.
- 31:40And then in higher dimensions,
- 31:42when we have four data points all
- 31:44intersecting in a pair wise manner,
- 31:47then we're going to draw a three simplex.
- 31:50So we are going to take our data set.
- 31:52In this case the data set happens to
- 31:55be two-dimensional and we are we are
- 31:58constructing these simplices from our
- 32:00data by expanding these epsilon radius
- 32:03discs around each point in the data set.
- 32:06And so in this visualization here,
- 32:09I'm simply showing you the 0
- 32:11simplex which which are the data
- 32:12points that we started from,
- 32:14and the one simplex which are all
- 32:17the edges that get created as we are
- 32:20expanding this epsilon radius disk.
- 32:22I'm not showing you the disk and I'm
- 32:25not showing you the field in triangles
- 32:27or the tetrahedrons simply because the
- 32:30the figure gets very, very crowded.
- 32:34So why do we want to construct this
- 32:38via Torres Ribs complex?
- 32:40Well,
- 32:41it turns out if you have some data shown
- 32:44here as these red dots that are sampled,
- 32:49this is your experimental data,
- 32:51and you imagine that this
- 32:53data is coming from some
- 32:55kind of underlying manifold.
- 32:57So there's a recognition here that
- 32:59whatever data we sample comes from
- 33:01a manifold that has maybe two holes
- 33:03in the middle of it, it turns out,
- 33:06and there's a theorem to prove this,
- 33:08although we're not going to go through.
- 33:10The proof of the theorem is that if your
- 33:13data is well sampled, so all these X,
- 33:16the points in X are sampled
- 33:18throughout the manifold quite well,
- 33:20then when you construct the
- 33:23viatorus rips complex from this
- 33:27data set for some radius epsilon,
- 33:30then this viatorus rips complex is basically
- 33:33equivalent to the underlying manifold.
- 33:36So in kind of more intuitive terms,
- 33:40what this theorem is saying is
- 33:42that if you want to learn the
- 33:45shape of your manifold where the
- 33:47data is being sampled from,
- 33:49it is sufficient to construct a Vietorisrips
- 33:53complex at some radius epsilon.
- 33:56And you will be able to find the manifold
- 33:59underneath the data and you'll be able
- 34:01to recognize the fact that your data
- 34:03is forming this one connected object
- 34:06which has two holes punched into it.
- 34:10OK, so let's get back to our example,
- 34:13the concentric circles example
- 34:15and the half moons example.
- 34:17And so here I'm showing you those
- 34:20epsilon radius discs around the data.
- 34:23So we have epsilon equals
- 34:26.05 at the beginning,
- 34:28we increase our epsilon value.
- 34:31And as we increase the epsilon value,
- 34:33these discs that I'm plotting in grade,
- 34:35they get bigger and bigger until
- 34:38they cover the whole space.
- 34:40And So what you can recognize here
- 34:43is that when our disk is quite small,
- 34:46even at epsilon equals .05,
- 34:49all the little points that
- 34:51are in the inner circle,
- 34:53they all get connected together
- 34:55because all of those disks are
- 34:58overlapping with each other.
- 35:00Then when we increase our epsilon to .15,
- 35:03the inner circles are still
- 35:06all connected together,
- 35:07but now the outer circles outer point.
- 35:11The points in the outer circle
- 35:13are also connected together.
- 35:14So we observe 2 loops in our data.
- 35:18As epsilon increases even more,
- 35:21these loops get closed in and they
- 35:24merge with each other until at the
- 35:27end when epsilon is really big,
- 35:29all of the disks intersect with
- 35:32each other and everything collapses
- 35:35into just one connected component
- 35:39in the two half moons data set.
- 35:42What we see is that there is a value
- 35:44of epsilon indeed where there is a
- 35:46small circle that forms as these
- 35:49points all get connected together
- 35:50in a pair wise manner.
- 35:52But that little circle quickly
- 35:55disappears when epsilon increases
- 35:57further and these two arcs get
- 36:00connected together into one hole.
- 36:02And so this is this technique
- 36:06is called persistence homology.
- 36:09And what it gives us is what we
- 36:12call a topological barcode.
- 36:15So there is code out there that will
- 36:18take these data points as an input,
- 36:21doesn't have to be two-dimensional
- 36:22or three-dimensional,
- 36:23could be high dimensional data
- 36:25and it will perform this kind of
- 36:28computation and give you back
- 36:29a visual that looks like this,
- 36:32which is the topological barcode.
- 36:34So let's kind of go through the
- 36:36topological barcode and learn
- 36:38how to interpret the barcode.
- 36:40The barcode consists of two parts.
- 36:42The top half I'm going to call H sub
- 36:45zero for dimension 0 homology and the
- 36:48lower part I'm going to call edge
- 36:51sub one for dimension 1 homology.
- 36:53And so in dimension 0 homology,
- 36:56we are measuring connectedness of
- 36:58our data and we generally call this
- 37:02number of connected components.
- 37:03And So what you can see here is
- 37:07that when epsilon is close to 0,
- 37:09where my cursor is,
- 37:11we see lots and lots of bars in our data set.
- 37:14And these bars correspond to how
- 37:17many connected components there
- 37:19are in our data set.
- 37:21So when epsilon is 0,
- 37:22all the points are sitting by themselves,
- 37:25none of the points are
- 37:27connected to each other.
- 37:28So we get as many bars as the
- 37:31number of points in our data.
- 37:33As epsilon starts increasing,
- 37:35we start merging together points
- 37:38by connecting them with an
- 37:41edge and forming A1 simplex.
- 37:43So as epsilon is increasing here
- 37:45you can see that the number of
- 37:48bars is fewer and fewer until at
- 37:51high values of epsilon we end up
- 37:53with just one bar in our barcode.
- 37:57So this dimension 0 homology,
- 38:00this is capturing the connectivity of
- 38:02our data and by looking at the slope
- 38:06by which these bars are decreasing in number,
- 38:09we can figure out how connected
- 38:11data our data set really is.
- 38:14In dimension 1,
- 38:15which is at the bottom of this barcode,
- 38:19what we are measuring is the
- 38:21presence of loops in our data set.
- 38:23So at epsilon equal to 0 on the very left,
- 38:26we have no bars in the lower
- 38:28part of this diagram,
- 38:30which means there are no loops
- 38:32present at that value of epsilon.
- 38:35At a later value of epsilon we
- 38:38see the occurrence of this 1st
- 38:40loop from these orange points
- 38:42in the inner concentric circle.
- 38:44That loop persists for a long period of time.
- 38:49What I mean by time is it persists
- 38:51for a large range of epsilon values.
- 38:55During this process,
- 38:56there is a second loop that forms,
- 38:59indicated by the second red bar.
- 39:01Here it emerges at at a higher
- 39:04value of epsilon,
- 39:06and this outer loop dies sooner
- 39:08than the inner loop does.
- 39:10The inner loop persists for even longer.
- 39:13So by looking at the bars in our bar code,
- 39:17we can learn that our data has so
- 39:20many points simply by counting the
- 39:23number of bars at epsilon equal to 0.
- 39:26We can learn how connected our data
- 39:29set is by looking at how these bars
- 39:32disappear as epsilon increases.
- 39:34And then in the lower part of the bar code,
- 39:37by looking at these bars,
- 39:40we can learn how many loops
- 39:42are present in our data.
- 39:43In particular,
- 39:44the bars that are longer in length
- 39:47actually represent actual loops
- 39:49that are present in our data.
- 39:51There are indeed some smaller
- 39:54bars which are small noisy
- 39:56loops that form as we perform this procedure.
- 40:00And what's apparent from these
- 40:02two barcodes is that in our first
- 40:05example with the concentric circles,
- 40:07there are two clear loops in that data.
- 40:11And in our second example, there is
- 40:13indeed a small loop that emerges here,
- 40:16but it quickly disappears,
- 40:18so there's no really topologically
- 40:21significant loops present
- 40:22in the second data set.
- 40:24And so these bar codes capture
- 40:26the shape of our data.
- 40:29You can continue to plot H2 and H3
- 40:33which are going to capture higher
- 40:35dimensional holes in your data.
- 40:37So H2 is going to capture 3 dimensional
- 40:41holes or voids in the data.
- 40:43H4 will capture even higher
- 40:45dimensional holes in the data.
- 40:46So topology captures the shape of our
- 40:49data by measuring connectedness and
- 40:50the presence of loops in the data.
- 40:53Are there any questions?
- 40:55Yes, Jay, just to kind of translate
- 40:57math into more intuition.
- 40:59When you say you have holes or loops,
- 41:02you're pretty much talking
- 41:04about some impossible states,
- 41:06meaning that your state cannot have this,
- 41:09like cannot be in the specific
- 41:11state for whatever reasons, right?
- 41:13Yeah, that's a great question.
- 41:15So I'm talking indeed about
- 41:17impossible states because these
- 41:19points are derived from experiments
- 41:21and they represent the state of our
- 41:24brain or the state of our tissue.
- 41:26And therefore if we have a hole in
- 41:29our data set, that means there's no
- 41:31data points present in the middle.
- 41:33And that means that there is that
- 41:35state is impossible as far as we can
- 41:38tell from our experimental data.
- 41:39So that's first conclusion.
- 41:42The 2nd conclusion which we can get is this.
- 41:46H1 measures kind of holes in two dimensions.
- 41:50And so that necessarily means that there
- 41:54is data that surrounds the hole, right?
- 41:56There must be some surrounding data and
- 41:58whenever there is data that surrounds a hole,
- 42:01that might indicate some kind
- 42:03of periodicity in the data set.
- 42:06So you can imagine that if you have data
- 42:08points that are arranged in a circle,
- 42:10doesn't have to be a perfect circle, it
- 42:12could be like an elliptical or skewed circle.
- 42:15This technique still works.
- 42:16But that that tells you that
- 42:19there is there is some sort of process.
- 42:21Yeah, there's a process that goes
- 42:23around in in a kind of periodic way.
- 42:26So you can navigate those that state
- 42:28space in a way that's periodic or
- 42:31almost periodic or quasi periodic.
- 42:34So impossible state.
- 42:35Well, as periodic states are being
- 42:38captured through dimension 1 homology
- 42:40in the status in this technique,
- 42:42indeed, I
- 42:44also have one question.
- 42:45So when we are increasing epsilon,
- 42:48yeah, are there some loops
- 42:52that are disappearing?
- 42:54Because if we increase epsilon,
- 42:56loops should not discover
- 42:58disappear, right loops
- 43:00can disappear. So the the way
- 43:02this outer loop is disappearing
- 43:04here is when there is a value of
- 43:07epsilon when one of the disks from
- 43:10the outer loop intersects with
- 43:12the disk from the inner loop.
- 43:14As soon as these two discs
- 43:16start intersecting,
- 43:17we draw an edge that goes from
- 43:19a point in the inner loop to a
- 43:21point on the outer loop and that
- 43:24effectively connects that those two
- 43:26loops together and the the enclosing
- 43:28space between the inner loop and the
- 43:31outer loop disappears at that stage.
- 43:33To get rid of the inner loop you have
- 43:36to increase epsilon a lot higher
- 43:38because you have two points that are
- 43:41opposite to each other in the inner loop,
- 43:44and when those two points
- 43:46get connected to each other,
- 43:47the inner loop closes in and disappears.
- 43:50So yes, loops can disappear,
- 43:53and in fact at a high enough value
- 43:56of epsilon, all loops will close up,
- 43:59and when epsilon is Infinity,
- 44:01all the points are necessarily
- 44:03intersecting with each other and there
- 44:05are no loops present in the data.
- 44:07Maybe this is a little bit more apparent
- 44:09in this previous animation when I was
- 44:12drawing the one simplex where you
- 44:14can see that at a higher value of epsilon,
- 44:16we do indeed end up closing the outer
- 44:19loop by connecting it to the inner loop.
- 44:22So you see that there's a bridge that forms
- 44:24from the outer loop to the inner loop,
- 44:25and that empty space here closes in
- 44:28and at a very high value of epsilon,
- 44:30there will be edges that
- 44:32go across the inner loop,
- 44:34closing the inner loop entirely.
- 44:36Although that doesn't happen in
- 44:38this animation because I didn't
- 44:40increase epsilon high enough.
- 44:43I suppose the question
- 44:45there is a, there is a question in the chat.
- 44:48And Zachary, would you like
- 44:49to ask your question live or
- 44:51would you like me to read that?
- 44:53I think I can read it.
- 44:54So the question says earlier you
- 44:57mentioned sufficient sampling of points
- 44:59as necessary to interpret a manifold.
- 45:01What you're showing now seems to
- 45:03address properties of the manifold.
- 45:04Is there a property to address how much
- 45:07variability is or isn't accounted for by
- 45:10this manifold characterization process,
- 45:12possibly due to under sampling?
- 45:14That's a great question.
- 45:16So what I'm presenting at the
- 45:19moment is under the assumption
- 45:22that our manifold is well sampled.
- 45:24So indeed, if your experimental
- 45:28procedure failed to sample a
- 45:30data point in the middle here,
- 45:32my conclusion would be that your
- 45:34data set has cellular states
- 45:36organized into two circles,
- 45:39and they're kind of
- 45:40independent of each other.
- 45:41There is indeed an outer circle,
- 45:43and that might be completely wrong
- 45:45simply because we never sampled data
- 45:48that exists between these two circles.
- 45:50So I am indeed operating under the
- 45:53assumption that the manifold is well sampled.
- 45:57Another aspect of the question was what
- 46:00what kind of like what properties of
- 46:03the manifold am I really capturing?
- 46:06And so I'm capturing topological
- 46:08properties of the manifold.
- 46:10And by topological properties I
- 46:12mean how connected that manifold
- 46:14is and whether or not there are
- 46:16holes in that manifold.
- 46:18And so this technique is like invariant
- 46:20to things like translation of the data.
- 46:23So if I take these circles and
- 46:25I translate them somewhere else,
- 46:27that doesn't impact it.
- 46:29If I take this diagram and I
- 46:31rotate it by 45° or 30°,
- 46:33that's not going to change
- 46:35the bar code at all.
- 46:36So it's translation invariant,
- 46:38it's rotationally invariant.
- 46:40And it's also invariant to certain
- 46:43kinds of deformations where if I take
- 46:46this arc and I deform it a little bit,
- 46:49that's not going to change my barcode
- 46:51and it's not going to change the
- 46:54fact that the data is connected
- 46:56in this Crescent moon shake.
- 46:58We'll get into more of the details
- 47:00of what aspects of the manifold we
- 47:03are capturing as we progress further.
- 47:04But I hope that kind of goes a little
- 47:07way towards answering your question.
- 47:11OK, so we have introduced this
- 47:14idea of a topological barcode.
- 47:16Next I want to show you a more convenient
- 47:20way of representing this barcode
- 47:22which is called a persistence diagram.
- 47:26And a persistence diagram is
- 47:28is very easy to construct.
- 47:29What you do is you draw two axes,
- 47:33the the X axis is called the
- 47:35birth axis and the Y axis is
- 47:37called the death axis generally.
- 47:39And these axes are going to represent
- 47:43when points start in the barcode.
- 47:45When, when do the bar start in
- 47:48the barcode and where do they end?
- 47:50And so you cannot complete or you cannot
- 47:53end a barcode before starting it.
- 47:57And because of that,
- 47:58all the points in this persistence diagram
- 48:01are going to happen above the diagonal.
- 48:04And there's two kinds of points here.
- 48:06I hope you can see that there's
- 48:08points that are represented as tiny
- 48:10circles and there are points that
- 48:11are represented as tiny diamonds,
- 48:13and so the circles are coming out
- 48:16of the H 0 dimension 0 homology,
- 48:19representing the connectedness of the data.
- 48:21They are all present here on the
- 48:24left side of the persistence diagram
- 48:26because all of these bars in H zero
- 48:29start at epsilon zero and then they end
- 48:32with some positive value of epsilon.
- 48:34Therefore,
- 48:35all of these points here represent
- 48:38H 0 and the points that are over
- 48:41here further away from zero,
- 48:44these are all representing H1.
- 48:47So I'm simply taking the starting
- 48:49coordinate of the bar and the
- 48:51ending coordinate of the bar,
- 48:52and I'm just representing it
- 48:54along these two axes.
- 48:56And this is what's called
- 48:58a persistence diagram.
- 49:00The conventional wisdom in persistence
- 49:03homology and topological data
- 49:05analysis is that points that are
- 49:09further away from the diagonal,
- 49:12they correspond to longer bars
- 49:14in the bar code,
- 49:16and those are the topologically more
- 49:19significant features in our data.
- 49:21So in this case,
- 49:23for the concentric circles example,
- 49:24we see two red diamonds that
- 49:26are far away from the diagonal,
- 49:29indicating the presence of two loops
- 49:32or two circles in our data set.
- 49:36This is the first,
- 49:38the topological barcode for the
- 49:39Half Moons data set,
- 49:41and this is the corresponding
- 49:43persistence diagram.
- 49:44You'll notice that this highlighted
- 49:46red diamond corresponding to that
- 49:49tiny loop that emerged for a bit
- 49:51is actually quite close to the
- 49:53diagonal in this persistence diagram,
- 49:55indicating that it's not very significant.
- 49:58Now,
- 49:59there are ways to compute statistics
- 50:01here and to figure out more quantity
- 50:04quantitatively whether or not a given
- 50:07topological feature is significant,
- 50:09but I'm not going to get into that today.
- 50:11There are bootstrapping methods
- 50:13that give you a confidence
- 50:15interval around the diagonal,
- 50:18and anything that falls inside of
- 50:20that confidence interval is going
- 50:23to be insignificant features.
- 50:25And anything that falls outside
- 50:27of that confidence interval is
- 50:29further away from the diagonal,
- 50:30and it's going to be topologically
- 50:33significant features.
- 50:34So those those kinds of tools exist,
- 50:36but I'm not getting into that today,
- 50:38just to build intuition.
- 50:42OK, so here's a small quiz that I like
- 50:44to do when I maybe present this also to
- 50:48undergraduates where I have three data
- 50:51sets and three persistence diagrams,
- 50:52but I've kind of jumbled them all up.
- 50:55And so let me just quickly tell you what the
- 50:57data sets are and what the diagrams are.
- 50:59I'll give you a moment to think,
- 51:01I'll drink some water,
- 51:02and then we'll go over the solution.
- 51:04So the first data set here,
- 51:07it's very hard to tell,
- 51:08but these are two spheres that where I
- 51:11have data sample from an inner sphere and
- 51:15a data sample from the outer sphere here.
- 51:19So it's in three dimensions.
- 51:21The second example are three circles where
- 51:23I have two circles that are concentric
- 51:26with one another and a separate circle
- 51:29that's outside of these two circles.
- 51:32And the third one is really hard
- 51:33to tell in this visualization,
- 51:35but this is data that's samples
- 51:37from the surface of a doughnut
- 51:40or a Taurus in mathematics.
- 51:42So this is again A3 dimensional object.
- 51:45It in the lower half,
- 51:46I'm showing you persistence diagrams where
- 51:49the black dots are representing dimension
- 51:520 homology and that's connected components.
- 51:56The red triangles are representing dimension
- 52:001 homology which are loops in our data.
- 52:04And now we have blue diamonds,
- 52:07and the blue diamonds are
- 52:09representing H2 dimension 2 homology,
- 52:12which are three-dimensional
- 52:14holes or voids in our data.
- 52:17And so I'd like you to think about
- 52:20matching these data sets to their
- 52:24corresponding persistence diagrams.
- 52:25A hint would be to look at
- 52:28the blue diamonds first,
- 52:29because blue diamonds are
- 52:31indicating 3D empty space.
- 52:33I'm going to take a quick drink of water
- 52:35and then we'll go over the solution.
- 52:49OK, So hopefully folks have realized
- 52:54that this persistence diagram on the left
- 52:59doesn't have any blue diamonds in it,
- 53:02doesn't have any 3D empty space in it,
- 53:05and therefore it corresponds to the second
- 53:08data set of the three concentric circles.
- 53:12You can see that there is 2 red triangles
- 53:14that are further away from the diagonal here,
- 53:17and there's one red triangle that's a
- 53:19little bit away from the diagonal here.
- 53:22And those three, those 3 triangles
- 53:25correspond to these three loops.
- 53:28One of the triangles is quite close
- 53:30to the diagonal because of the
- 53:31fact that these are concentric,
- 53:33so you can kind of bridge across
- 53:35them quite easily.
- 53:36Now we have 2 persistence diagrams
- 53:39that have diamonds in them,
- 53:41and to figure out which one is which,
- 53:44I think you have to look at some
- 53:47of these triangles again.
- 53:49And So what distinguishes the right one
- 53:51from the left one is on the right one.
- 53:53I'm not really seeing any triangles
- 53:55that are very far from the diagonal.
- 53:58But in this one,
- 53:59I see one triangle here that's
- 54:01far from the diagonal.
- 54:03And maybe there's another triangle
- 54:04here that's kind of separated from all
- 54:07the noise over here that's slightly
- 54:09further away from the diagonal.
- 54:11And so the way you can get these two
- 54:14triangles is because when you have a Taurus,
- 54:17if you think about a Taurus as a doughnut,
- 54:20you have a a circle that goes
- 54:24across the torus,
- 54:25like a horizontal circle
- 54:27going across the donor.
- 54:29And then you have another loop,
- 54:31another circle that goes kind of
- 54:34perpendicular to the 1st circle that
- 54:36goes around the donor in this way.
- 54:39So then in a donor,
- 54:40there are two loops and there's
- 54:43one empty space or one 3D hole.
- 54:46Whereas in the concentric spheres example,
- 54:49you just have the empty space between the
- 54:53two spheres that's shown by this diamond.
- 54:57So sorry,
- 54:58there's 22 empty spaces.
- 54:59There's empty space inside the inner sphere,
- 55:02and there's empty space between the
- 55:05outer sphere and the inner sphere.
- 55:07And that shows up here because you
- 55:10have one blue diamond up here,
- 55:12and then maybe you have one blue
- 55:14diamond here that's slightly
- 55:15further away from the diagonal.
- 55:17And so those correspond to the space
- 55:19inside the inner sphere and the
- 55:22interstitial space between the two spheres.
- 55:25So hopefully that helps you build
- 55:27some intuition.
- 55:27This is a lot easier to do when
- 55:29I put confidence intervals,
- 55:31but if I do put confidence intervals,
- 55:33we'll have to compute them separately for
- 55:36dimension 0, dimension 1, and dimension 2.
- 55:39And so that makes things.
- 55:40Also,
- 55:41I don't really have enough space
- 55:42to to draw 3 persistence diagrams
- 55:45with three confidence intervals,
- 55:47but hopefully that makes sense.
- 55:48If you have any question about how to
- 55:51interpret these persistence diagrams
- 55:52or like the solution to this little quiz,
- 55:55please feel free to chime in.
- 55:59OK, so now at this point,
- 56:03we we have figured out how to take
- 56:05our point cloud data set and convert
- 56:08it into a topological barcode,
- 56:11which we can represent
- 56:13as a persistence diagram.
- 56:14So the next thing we want to do is we
- 56:17want to compare two different data sets.
- 56:20And so comparing two different
- 56:22point clouds can be quite tricky.
- 56:24You don't really know like how to
- 56:27distinguish a torus from a sphere
- 56:30from some other blobby thing.
- 56:32And so one way in which you can
- 56:35compare these kinds of data to
- 56:37different point clouds is by instead
- 56:40comparing their persistence diagrams.
- 56:43And so there are multiple
- 56:46techniques to compute distances
- 56:48between persistence diagrams.
- 56:51And one of those techniques is what's
- 56:53called the bottleneck distance.
- 56:55And So what happens in the bottleneck
- 56:58distance is that you paired up.
- 57:01And again,
- 57:02I should apologize here because I'm
- 57:04using color slightly differently here.
- 57:06So the the blue colored dots are
- 57:09from first persistence diagram,
- 57:11diagram one and the red
- 57:13squares are from diagram 2.
- 57:16And so we want to compare
- 57:17diagram 1 to diagram 2.
- 57:19And the way we do that is by first
- 57:22matching features in diagram
- 57:231 to features in diagram 2,
- 57:26where we also allow to allow ourselves to
- 57:29map certain features to the diagonal itself.
- 57:32So that's a matching process that happens.
- 57:35And once you have matched the features
- 57:38to each other or to the diagonal,
- 57:40you find the two paired features that
- 57:43are furthest away from each other and
- 57:46you compute this distance between them.
- 57:48This is called the bottleneck distance,
- 57:51and there's ways to represent
- 57:53that mathematically.
- 57:54Here that's not so important.
- 57:56The intuition is probably
- 57:58what's most important,
- 57:59and there is a very important theorem
- 58:02in the field that guarantees stability
- 58:05is called the stability theorem.
- 58:08And what it says is that if I have a
- 58:11point cloud X and a point cloud Y,
- 58:14if my point cloud X is just a slight
- 58:16perturbation of point cloud Y.
- 58:18So I've just moved the points
- 58:20around a little bit.
- 58:21Then the bottleneck distance between
- 58:24the persistence diagrams computed
- 58:26through the Beatrice ribs complex of
- 58:28X and the Beatrice ribs complex of Y.
- 58:31This bottleneck distance is
- 58:33guaranteed to be small because if
- 58:36X is slightly different from Y,
- 58:39the right hand side of this
- 58:41equation is going to be close to 0,
- 58:42and therefore the bottleneck distance
- 58:44is going to be very very small.
- 58:47So this basically guarantees the fact
- 58:49that if you have one point cloud,
- 58:51maybe points arranged as a circle
- 58:53and you tweak the point slightly,
- 58:55so you add a little bit of noise
- 58:56to those points,
- 58:57then the bottleneck distance is
- 58:59not going to change much.
- 59:01So it means that this bottleneck distance
- 59:04is a stable way of comparing point clouds.
- 59:07Again, I don't care so much about the math.
- 59:09That's kind of the, the,
- 59:10the main result is that topology is,
- 59:13is able,
- 59:15it's,
- 59:15it's robust to these kinds of
- 59:18noise and perturbations in
- 59:20our data. Now one of the problems
- 59:22with the ball neck distance is that
- 59:25we are doing all this matching,
- 59:26but ultimately we are only really
- 59:28looking at the distance between points
- 59:30that are matched but are furthest away
- 59:33from each other and we're ignoring
- 59:35all the other points that got matched.
- 59:37So maybe you want to be more sensitive
- 59:40to how well the matching works and
- 59:43and the way to actually use that
- 59:46information is to compute what's
- 59:48called the washer steam distance,
- 59:50where you perform the matching process 1st.
- 59:54And then you compute this washer steam
- 59:57distance between diagram one and diagram 2,
- 59:59where you sum over the distance between
- 01:00:02all the points that are matched to
- 01:00:05each other and you ignore the points
- 01:00:07that get matched to the diagonal.
- 01:00:08So this is a much more even more stable
- 01:00:12way of comparing 2 persistence diagrams.
- 01:00:16And it also has similar stability
- 01:00:18properties that I was talking about
- 01:00:21previously in the Washington steam distance.
- 01:00:24If you have,
- 01:00:24if you have any kind of experience
- 01:00:27with like optimal transport theory
- 01:00:29or like statistics,
- 01:00:30then I just wanted to highlight that
- 01:00:33the washer steam distance that we're
- 01:00:35talking about here is similar to,
- 01:00:37it's actually exactly the same as
- 01:00:39the washer steam distance that you
- 01:00:40would be familiar with in the sense
- 01:00:42that you have a transport map and
- 01:00:44you're kind of moving mounds of
- 01:00:46earth from one place to another.
- 01:00:48Or the Washington steam distance that you
- 01:00:50used to compare to probability distributions.
- 01:00:52You can think about these topological
- 01:00:54features as like probability distributions
- 01:00:56and you're learning to map 1 to the other.
- 01:00:58So this is just an aside for folks who might
- 01:01:02be more familiar with optimal transport.
- 01:01:05OK,
- 01:01:07I wanted to put this like review
- 01:01:10article in because I have talked
- 01:01:13so far about some about extracting
- 01:01:16topological features from point cloud
- 01:01:18data set and then learning how to
- 01:01:21interpret those topological features
- 01:01:23and to compare two different data
- 01:01:26sets by computing the bottleneck
- 01:01:28distance or the washer steam distance
- 01:01:31between their topological features.
- 01:01:33But it is possible to also compute
- 01:01:37to use topology in a different way,
- 01:01:39where you use topology to inform the
- 01:01:43training of a machine learning architecture.
- 01:01:47And so for folks who are familiar
- 01:01:49with machine learning,
- 01:01:50I just wanted to point out that
- 01:01:53there are ways in which you can use
- 01:01:55topology inside the loss function
- 01:01:57of your neural network.
- 01:01:59So you can have a topology informed loss
- 01:02:02and a good example of this that is a
- 01:02:05paper by Channel in 2019 that I can point to.
- 01:02:08You can also in machine learning
- 01:02:11use topology to compare two
- 01:02:13different model architectures.
- 01:02:15One way of doing this would be to
- 01:02:17have two different like machine
- 01:02:19learning architectures where you
- 01:02:20look at the activations in all
- 01:02:23the layers of those architectures,
- 01:02:25treat that as a point cloud and compare
- 01:02:27them against each other using Washington
- 01:02:30distance of their persistence features.
- 01:02:32And that's in a good example of that is Zo
- 01:02:34ET al. In 2021.
- 01:02:36And then the what I'm talking about,
- 01:02:39which is similar to this paper in 2017,
- 01:02:41is to actually just take your data and
- 01:02:44use topology to featurize that data,
- 01:02:47which is to extract topological
- 01:02:49features of that data,
- 01:02:50learn to interpret those
- 01:02:52topological features,
- 01:02:53and then perhaps pass them into
- 01:02:55some machine learning framework
- 01:02:57to generate some kind of output.
- 01:02:59So there are different places in machine
- 01:03:01learning that where one can use topology.
- 01:03:04In our case,
- 01:03:05we are going to focus on ways in
- 01:03:07which we use topological features
- 01:03:09extracted from our data and pass
- 01:03:12them into machine learning in order
- 01:03:14to do some kind of downstream task.
- 01:03:17OK, so next I wanted to cover a few ways
- 01:03:21of taking these topological features
- 01:03:25and converting them into summaries.
- 01:03:28And the reason for doing that
- 01:03:30is because we want to use these
- 01:03:33topological features as input for
- 01:03:35machine learning down the line.
- 01:03:37And these diagrams that I've
- 01:03:39been drawing for you so far,
- 01:03:41they're easy to draw on a screen,
- 01:03:43but they're not really great for
- 01:03:45machine learning because if you
- 01:03:47have a bunch of different data sets,
- 01:03:49you're going to get different number of
- 01:03:52topological features for each data set.
- 01:03:54And you don't really know a way of
- 01:03:56like converting this into something
- 01:03:58that can go into machine learning.
- 01:03:59So folks have found various ways of
- 01:04:03taking these persistence diagrams
- 01:04:04and converting them into even more
- 01:04:08convenient representations that can be
- 01:04:10used for either with medical analysis,
- 01:04:13but more importantly for machine
- 01:04:14learning down the road.
- 01:04:16And one such representation is
- 01:04:18called the persistence landscape,
- 01:04:20where you are taking a diagram like this
- 01:04:23and converting that into a function.
- 01:04:25And the way you do that is,
- 01:04:26is quite simple really.
- 01:04:28You take each point and you draw a
- 01:04:3110th function based off of that point
- 01:04:33by connecting it to its X coordinate
- 01:04:36and connecting it to its Y coordinate
- 01:04:39intersected with the diagonal.
- 01:04:41It takes more words to describe,
- 01:04:43so you just you can just simply
- 01:04:45see it from this picture.
- 01:04:47You draw this little tent function
- 01:04:50and then tilt the diagram by 45° and
- 01:04:53there you end up getting this function
- 01:04:57representation of your persistence diagram.
- 01:05:00You can now treat this as a function,
- 01:05:02and you can use tools from functional
- 01:05:06analysis to analyze this persistence diagram.
- 01:05:09Again,
- 01:05:09you can kind of formalize this with a
- 01:05:12bunch of math by drawing out what A10
- 01:05:15function looks like and how you take
- 01:05:17a diagram and convert it into a function,
- 01:05:21but this is all just notation.
- 01:05:25I simply want to convey to you
- 01:05:27the intuition behind taking a
- 01:05:29diagram and converting it into a
- 01:05:32function for downstream analysis.
- 01:05:34There's some important reasons why one
- 01:05:36might want to convert this into a function.
- 01:05:39One of them is that you can use
- 01:05:41tools from functional analysis.
- 01:05:43Another thing that's important is
- 01:05:45that this is an injective mapping,
- 01:05:48and it satisfies the same properties
- 01:05:50that persistence diagrams satisfy.
- 01:05:54Another convenient way of converting
- 01:05:57persistence diagrams into something
- 01:05:59that's useful for machine learning is
- 01:06:02to convert the diagram into an image.
- 01:06:05The reason you might want to do this is
- 01:06:07because we have architectures that are very,
- 01:06:10very good at dealing with images.
- 01:06:12We know how to classify
- 01:06:14cats and dogs and horses.
- 01:06:16We also know how to generate images.
- 01:06:18So we can take advantage of all the tools
- 01:06:20we have developed for dealing with images
- 01:06:23in machine learning if we can convert
- 01:06:25persistence diagrams into an image.
- 01:06:27And the way one goes about doing that is
- 01:06:30you take your input persistence diagram,
- 01:06:33you tilt it again by 45°.
- 01:06:36So now you're measuring the birth
- 01:06:38coordinate and you're measuring
- 01:06:39distance from the diagonal, which we
- 01:06:41call persistence on the Y coordinate.
- 01:06:44So nothing fancy,
- 01:06:45just kind of tilting the diagram.
- 01:06:47Then what you do is at each point in
- 01:06:51the diagram, you drop a Gaussian,
- 01:06:54so like a 2D Gaussian,
- 01:06:56and you weigh the Gaussians by
- 01:06:58distance away from the X axis.
- 01:07:01So points that are higher
- 01:07:03up get a brighter Gaussian.
- 01:07:04The points that are lower down,
- 01:07:06they get a lower amplitude Gaussian.
- 01:07:09Again,
- 01:07:09the rationale for doing that is that
- 01:07:11points that are further away are points
- 01:07:14that are more topologically significant.
- 01:07:16Points that are closed are kind of
- 01:07:18derived from some noise in our data,
- 01:07:20and we want to be robust to noise.
- 01:07:22So it makes sense to weigh things
- 01:07:24by distance away from the diagonal.
- 01:07:26This is called a persistence image.
- 01:07:28This is still a continuous object.
- 01:07:30And So what you can do then is you
- 01:07:33can take this surface and you can
- 01:07:35just divide it into smaller pixels
- 01:07:38and convert this into an image format.
- 01:07:41And once you have this in an image format,
- 01:07:43you can use convolutional neural
- 01:07:45networks and other kinds of like
- 01:07:48generative AI tools to take advantage
- 01:07:50of like all of those tools to to
- 01:07:53work with these persistence images.
- 01:07:56Again,
- 01:07:56there's a bunch of math that one
- 01:07:58can write down to kind of formally
- 01:08:00describe this process,
- 01:08:01but I think the visuals do a much better job.
- 01:08:06Finally, you can convert your
- 01:08:08persistence diagrams into what are
- 01:08:11called smooth persistence curves.
- 01:08:13And the way this works is
- 01:08:16you walk along the diagonal,
- 01:08:18and as you're walking along the diagonal
- 01:08:21from the bottom left to the top right,
- 01:08:23you look at a window.
- 01:08:25And you construct the
- 01:08:26window by looking at this,
- 01:08:28like this rectangular section
- 01:08:30that's to the top left of
- 01:08:33wherever you are on the diagonal.
- 01:08:36And what you do is you compute some
- 01:08:38kind of statistic of points that
- 01:08:41exist within this little window.
- 01:08:42So one simple statistic would be
- 01:08:45simply counting the number of points
- 01:08:47that exist within this window.
- 01:08:49So that allows you to construct a
- 01:08:52function as you're walking from
- 01:08:55left to right, construct a curve,
- 01:08:58a continuous curve,
- 01:08:59which you can then analyze.
- 01:09:03I don't have the curve here for some reason,
- 01:09:06but you can imagine like as
- 01:09:07you're walking along the diagonal,
- 01:09:08just counting how many objects
- 01:09:10exist within this window over time
- 01:09:12gives you a continuous curve,
- 01:09:14which you can describe, but medically
- 01:09:15you can prove things about that curve.
- 01:09:17And one of the reasons why you might
- 01:09:19want to use this curve is that there
- 01:09:21are ways to speed this process up a lot
- 01:09:24because these are all like Gaussians.
- 01:09:26And so there's ways to compute
- 01:09:28this curve very, very fast.
- 01:09:30So that's also helpful for
- 01:09:33machine learning purposes.
- 01:09:35OK, Finally I just wanted to give you
- 01:09:38some sense of how you do topology for
- 01:09:42data that is not point cloud based data.
- 01:09:45What if you have like images
- 01:09:47that you want to work with?
- 01:09:50You can compute topology
- 01:09:52directly from images.
- 01:09:54So think of an image as nothing
- 01:09:56but a matrix of values, right?
- 01:09:58And the values are going to depict
- 01:10:00how bright a given pixel is.
- 01:10:02So if I have one here,
- 01:10:03that pixel is quite dark,
- 01:10:055 is brighter, three.
- 01:10:07Sorry,
- 01:10:08this is not really the color here
- 01:10:10doesn't really correspond to the value,
- 01:10:12but five would be a brighter pixel.
- 01:10:14Three would be slightly dimmer than five,
- 01:10:16but brighter than one.
- 01:10:17So you can think of an image as nothing
- 01:10:20but a matrix of image intensity values.
- 01:10:22And you can perform the same kind of
- 01:10:26filtration that we did previously by
- 01:10:28expanding that epsilon radius disk by
- 01:10:31going through this matrix of values
- 01:10:34and simply deleting everything that's
- 01:10:37above a value or below a value.
- 01:10:39And these are called sub level set
- 01:10:41or super level set filtrations,
- 01:10:43where in this example,
- 01:10:45any value that's only values that are
- 01:10:48less than or equal to one are shown
- 01:10:50where we spot two holes in our data.
- 01:10:52So five and three become holes.
- 01:10:55Then you increase your threshold to three.
- 01:10:57So now three gets filled in,
- 01:10:59but there's only one hole in the data set.
- 01:11:01And then as you increase your
- 01:11:03threshold to five is then both
- 01:11:05of those holes get filled in.
- 01:11:07So when you're working with images,
- 01:11:09you can construct what are
- 01:11:12called cubicle complexes,
- 01:11:13where you define a threshold
- 01:11:15value for your image,
- 01:11:16and by applying that threshold you
- 01:11:19can count in your pixels how many
- 01:11:22holes exist and you can quantify the
- 01:11:25shape of an image in that manner.
- 01:11:27I'll show you how this can be used in a
- 01:11:30very powerful way in a subsequent workshop.
- 01:11:33I just want to point to the paper here.
- 01:11:37So there's a nudist paper came out
- 01:11:39in 2020 by Bastian Riek and and
- 01:11:42other folks in the topology community
- 01:11:45where they took F MRI images,
- 01:11:47which were volumetric F MRI images.
- 01:11:50So there's lots and lots of data.
- 01:11:52And they performed this cubicle
- 01:11:54complex filtration of these
- 01:11:56volumetric images to construct a
- 01:11:59sequence of persistence diagrams,
- 01:12:01which they converted into persistence images.
- 01:12:04And from these persistence images,
- 01:12:06they were able to use many
- 01:12:08for learning techniques to
- 01:12:09categorize different
- 01:12:10brain state trajectories.
- 01:12:12So this is a very cool paper that
- 01:12:14combines a lot of the stuff that we
- 01:12:17talked about of going from images to
- 01:12:20persistence diagrams to persistence
- 01:12:21images and those images being used
- 01:12:24as input for machine learning to
- 01:12:26classify brain state trajectories.
- 01:12:29You can also take directly
- 01:12:31the persistence diagram,
- 01:12:32compute summaries of the persistence
- 01:12:34diagram such as persistence
- 01:12:36landscapes and persistence curves.
- 01:12:38This is a persistence curve here,
- 01:12:40and you can use those persistence
- 01:12:42curves directly to perform regression
- 01:12:44tasks such as estimating the severity
- 01:12:46of the disease in these F MRI images.
- 01:12:51Lastly, something that's going to be
- 01:12:54highly relevant to us going forward
- 01:12:56is doing TDA on time series data.
- 01:13:00So here's an example of
- 01:13:02a time series data set.
- 01:13:04These are just two sinusoidal
- 01:13:06curves, F1 and F2.
- 01:13:08You can see F1 has a higher amplitude,
- 01:13:11F2 has a smaller amplitude over time.
- 01:13:15And So what you can do is you
- 01:13:16can plot them against each other.
- 01:13:18So you can plot F1 against F2.
- 01:13:21And this is one way of
- 01:13:23taking time series data.
- 01:13:24You have to discretize in time of course,
- 01:13:26and converting them into
- 01:13:28a point cloud data set.
- 01:13:30And you can compute topology directly
- 01:13:32from this point cloud data set.
- 01:13:34So this works when you have two
- 01:13:36time series data sets, F1 and F2.
- 01:13:38You can convert that into a point cloud.
- 01:13:41If you have just one time series data set,
- 01:13:44what you can do is you can do a
- 01:13:46sliding window transformation.
- 01:13:48So you take a small sliding window,
- 01:13:50so a small chunk of the data,
- 01:13:52move that window forward 1 by 1 by 1.
- 01:13:55And for within that window,
- 01:13:57you can construct this phase portrait or
- 01:14:00this time delay embedding as it's called.
- 01:14:02And then you can take that loop and you
- 01:14:05can convert that into a persistence diagram.
- 01:14:08So I wanted to show you some
- 01:14:11examples of kind of doing this,
- 01:14:13right.
- 01:14:14So this is combining both cubicle
- 01:14:17homology and this time delay embedding
- 01:14:20the sliding window embedding and using
- 01:14:23that to compute persistence diagrams.
- 01:14:27And so again, I'm not going to
- 01:14:28go through all the details here,
- 01:14:30but I just wanted to show you an
- 01:14:32an application of this in practice.
- 01:14:34So this is a paper which was
- 01:14:36on archive in 2018.
- 01:14:37It might be out already.
- 01:14:39I hope it's out by now.
- 01:14:40And in the, in this data set,
- 01:14:42they were looking,
- 01:14:44they were imaging the vocal
- 01:14:46cords of humans as they were,
- 01:14:49they were making some sounds.
- 01:14:52And so when you're making like
- 01:14:54a rhythmic pattern of sounds,
- 01:14:56your vocal cords, they open, they close,
- 01:14:58they open and then they close.
- 01:15:00And so in this data set,
- 01:15:03obviously there's a periodic nature to
- 01:15:04the type of sound you're producing.
- 01:15:07And so if you look at these
- 01:15:09images of the vocal cords and you
- 01:15:11compute image self similarity,
- 01:15:12you can kind of guess that there
- 01:15:14is a period after which the
- 01:15:17image becomes similar to itself.
- 01:15:18And so you can kind of quantify the
- 01:15:21periodicity of the data in this way.
- 01:15:23But also,
- 01:15:23if you take the sequence of images
- 01:15:26and you do this time delay embedding
- 01:15:29and compute cubicle homology,
- 01:15:30you can end up with a persistence
- 01:15:33diagram where very clearly you
- 01:15:35see this H1 feature which tells
- 01:15:37you there's a loop in your data,
- 01:15:39which means that the data is periodic.
- 01:15:42What's cooler is when you do
- 01:15:45what's called bifonation.
- 01:15:46So in bifonation,
- 01:15:47the vocal cords move in a way
- 01:15:50that they produce two different
- 01:15:52frequencies at the same time.
- 01:15:54So you have like a high frequency,
- 01:15:57a sound coming out,
- 01:15:58and at the same time you have like
- 01:16:01a low frequency whine coming out.
- 01:16:03And I was thinking if I can do a
- 01:16:05demonstration of this kind of voice,
- 01:16:06but I really cannot do it.
- 01:16:08So you have to.
- 01:16:09If you search online for Bifor Nation,
- 01:16:11you will find examples of people who
- 01:16:13can produce both high frequencies
- 01:16:15and low frequencies at the same time.
- 01:16:17And if you look at the vocal cord
- 01:16:20images of producing this kind of sound,
- 01:16:22this is kind of what it looks like
- 01:16:24when you look at self similarity.
- 01:16:25You do observe a pattern here.
- 01:16:28But very importantly,
- 01:16:29when you take this data and you plug it
- 01:16:32through the techniques that I've described,
- 01:16:34you get a persistence
- 01:16:36diagram that looks like this,
- 01:16:38which has 2H1 features and one H2 feature.
- 01:16:42So remember,
- 01:16:44H1 is is the dimension 1 hole and
- 01:16:48H2 is dimension 2 hole or a void.
- 01:16:51And so from our previous quiz,
- 01:16:53hopefully you recall that 2H1 holes and
- 01:16:56one H2 hole means it's like a torus,
- 01:16:59which means there's empty hole in the
- 01:17:02middle and there are two loops in the torus.
- 01:17:04And that makes perfect sense for
- 01:17:06this data set because you have a
- 01:17:09high frequency and a low frequency
- 01:17:10forming 2 loops here.
- 01:17:12And then you have because it's
- 01:17:13arranged like a Taurus,
- 01:17:14both of those things are happening
- 01:17:16at the same time.
- 01:17:17You get a dimension to all in the
- 01:17:19data set or as it says in here,
- 01:17:21to a two cycle in the data set.
- 01:17:25And this again is from the same paper.
- 01:17:27This is an example where the person is
- 01:17:31showing irregular vocal fold vibrations.
- 01:17:34So there is no periodicity,
- 01:17:36no quasi periodicity.
- 01:17:38It appears random.
- 01:17:40When you look at image self similarity,
- 01:17:42it just goes along the diagonal.
- 01:17:44You don't see a lot of like important
- 01:17:47self similarity off the diagonal,
- 01:17:49which means that all of these images
- 01:17:51look kind of different from each other.
- 01:17:53And if you throw this into TDA,
- 01:17:55you get topological features that
- 01:17:57are very close to the diagonal.
- 01:18:00Again,
- 01:18:01you can compute like confidence
- 01:18:03intervals and so forth for these things.
- 01:18:06But again,
- 01:18:07it shows there's no interesting
- 01:18:09topology happening in this data
- 01:18:11set because it's irregular.
- 01:18:12There's no quasi periodicity
- 01:18:14or periodicity in this data.
- 01:18:19Lastly, topology is invertible
- 01:18:22to a certain extent.
- 01:18:23So folks often ask like, OK,
- 01:18:26I created this persistence diagram.
- 01:18:28I want to interpret the where these
- 01:18:31topological features come from.
- 01:18:33And you can do that using something
- 01:18:36called cycle representatives.
- 01:18:37And what cycle representatives
- 01:18:39allow you to do is they allow you to
- 01:18:42interrogate a specific topological
- 01:18:44feature and ask the question,
- 01:18:46where does that feature come
- 01:18:47from in your input data set?
- 01:18:50So for example,
- 01:18:51if we have this persistence diagram that's
- 01:18:53derived from this point cloud data set,
- 01:18:55you can then interrogate this
- 01:18:57topological feature and this it
- 01:19:00will tell you that this dimension
- 01:19:020 feature appears here because
- 01:19:04these two clusters of data became
- 01:19:07connected at that epsilon value.
- 01:19:09So the these two connected
- 01:19:11components disappeared,
- 01:19:12they merged together at that epsilon value.
- 01:19:16And likewise for dimension one,
- 01:19:18you can interrogate that topological
- 01:19:20feature and it will tell you
- 01:19:22that that particular loop is
- 01:19:24formed by these four points.
- 01:19:26This is very,
- 01:19:27very important for us because when we
- 01:19:29are dealing with the state space of
- 01:19:32cellular activity and neural activity,
- 01:19:34having access to these cycle
- 01:19:37representatives will allow us to,
- 01:19:39will will give us the ability to
- 01:19:41say which time points and which
- 01:19:44parts of the brain precisely led
- 01:19:46to the formation of a cycle which
- 01:19:49indicates periodic activity.
- 01:19:51So we can indeed go back in reverse
- 01:19:54from topological features back to
- 01:19:57our original data set and figure
- 01:19:59out why a certain topological
- 01:20:02feature exists in our data.
- 01:20:04And so I think we are at a stage here
- 01:20:07where we don't have a lot of time left.
- 01:20:10We're supposed to end at 5:30.
- 01:20:12So I'm not going to go through the ML parts.
- 01:20:14I had a few ML slides,
- 01:20:16but I think we can punt that to
- 01:20:18our third workshop.
- 01:20:19I think it would be a good time
- 01:20:21to take questions and end here.
- 01:20:23I just wanted to mention if you're,
- 01:20:24if you're about to leave,
- 01:20:25we're going to have another workshop
- 01:20:28next week that will be given by
- 01:20:30Rahul and he'll be telling you how
- 01:20:33we can take a graph consisting of
- 01:20:35nodes and edges and use graph signal
- 01:20:38processing to quantify how some
- 01:20:40signal is distributed on that graph.
- 01:20:43And then the following week,
- 01:20:45me and Brian,
- 01:20:46we're going to put all of
- 01:20:48these things together,
- 01:20:49talk about GSTH as a technique
- 01:20:52all combined together and we'll
- 01:20:54show you how we have used this
- 01:20:56technique with some of our data sets.
- 01:20:59Thanks for listening.
- 01:21:00Thanks for coming.
- 01:21:01Jay,
- 01:21:02I have a question.
- 01:21:03It may be a little bit premature,
- 01:21:05but is my intuition just
- 01:21:07correct my intuition?
- 01:21:09Do you understand correctly that
- 01:21:11if you have more organised,
- 01:21:13more complexly organized systems,
- 01:21:16you should see higher level holes?
- 01:21:21And if it's mostly random noise,
- 01:21:24you kind of don't really see much? Yeah.
- 01:21:27So if you have random noise, then in space,
- 01:21:30everything will get filled in, right?
- 01:21:33It's all just noise.
- 01:21:34So there'll be no structure to the data set
- 01:21:37and you won't see any holes in the data.
- 01:21:40The connectivity pattern
- 01:21:41also looks different.
- 01:21:42So you can do statistical tests where you
- 01:21:45can take real data from an experiment and
- 01:21:49compare that with topological features
- 01:21:51derived from like standard distributions,
- 01:21:54like uniform distribution
- 01:21:55and Gaussian distribution,
- 01:21:56and it will tell you that in your
- 01:21:58experiment in the state space,
- 01:22:00it kind of looks like a uniform distribution.
- 01:22:02There's no structure to it. Yeah.
- 01:22:04So that's that's one aspect.
- 01:22:06Also like dimension zero will tell you
- 01:22:08if in your trajectory here you have two
- 01:22:11different connected components, right.
- 01:22:13So if you have like one like set
- 01:22:15of States and then a completely
- 01:22:17different set of States and they're
- 01:22:19kind of far apart from each other.
- 01:22:21That's what we learned from like
- 01:22:23dimension 0 homology in addition to like,
- 01:22:25you know,
- 01:22:26noise and how the data is distributed.
- 01:22:28And dimension one will tell us these
- 01:22:30periodic loop like structures that
- 01:22:32might exist in our data and also the
- 01:22:34empty spaces being states that cannot
- 01:22:37really exist based off of like this
- 01:22:39this experimental data again being
- 01:22:41cognizant of the earlier question where
- 01:22:43if your data is not sampled correctly,
- 01:22:45it might be telling you the wrong thing.
- 01:22:49OK, thank you. Do you have any other
- 01:22:54questions in the chat or if anybody
- 01:22:56wants to ask any follow up questions?
- 01:23:01All right, I don't see any questions.
- 01:23:03Thank you so much.
- 01:23:05Just a note, all these papers that
- 01:23:08you mentioned in the presentation
- 01:23:09I'm going to add to your website.
- 01:23:12So if people are interested
- 01:23:14in looking at those papers,
- 01:23:16there will be links on your maps site.
- 01:23:21And thank you again.
- 01:23:22And I'll see you next week. See
- 01:23:24you next week. See you. Bye.