Skip to Main Content

MAPS_GSTH_part_I

June 18, 2024
  • 00:00It is my pleasure to introduce
  • 00:03the Hanan Jay Bhaskar Jay.
  • 00:06He is a past doctoral researcher
  • 00:08in the Department of Genetics
  • 00:10at Yale School of Medicine.
  • 00:12He has background strong quantitative
  • 00:15background both in mathematical
  • 00:18modeling and machine learning,
  • 00:20anthropological data analysis
  • 00:22with applications in biophysics
  • 00:24and biomedical research.
  • 00:26He received his PhD in Biomedical
  • 00:28Engineering and Data and Master's degree
  • 00:31in Data Science from Brown University,
  • 00:34and for that he studied computer
  • 00:36science and applied mathematics
  • 00:38at University of British Columbia.
  • 00:40This is going to be 4 parts series and
  • 00:44we have a group of presenters who I will
  • 00:48introduce for each section separately.
  • 00:52And Jay, I'm going to give it to you.
  • 00:54So if you want to add anything about
  • 00:56overall research, you're welcome to.
  • 00:58And if you guys,
  • 01:00anybody has questions,
  • 01:01you're welcome to type them in the
  • 01:04chat or in question and answers.
  • 01:06And Jay said that he will respond
  • 01:08to them as they come.
  • 01:10And thank you, Helen,
  • 01:12for the very kind introduction.
  • 01:14And welcome, everyone,
  • 01:16to the first workshop in this series.
  • 01:19Today, my goal is to introduce you
  • 01:23to a methodology called topological
  • 01:26data analysis and machine learning.
  • 01:29Both of these techniques are,
  • 01:32you know, very broad.
  • 01:33They encompass a number of different
  • 01:36methods and so there's no way I can
  • 01:38cover both of them in any amount of
  • 01:41detail in just a single session.
  • 01:43My goal is to give you a broad
  • 01:46overview of these techniques and
  • 01:48to build some intuition and for
  • 01:52us to share a common vocabulary.
  • 01:55And then in the subsequent workshops,
  • 01:57we are going to take in,
  • 01:59take bits and pieces of topological
  • 02:02data analysis, TDA in short,
  • 02:04and machine learning that we need
  • 02:08to analyze some neuroimaging data.
  • 02:11And as Helen mentioned, my name is Tanan Jay.
  • 02:14I generally go by Jay.
  • 02:16I wear many different hats,
  • 02:18but most relevant is I'm a
  • 02:22postdoctoral fellow in neuroscience.
  • 02:24I also work, I wanted to disclose,
  • 02:28with Bohinger Ingelheim,
  • 02:29which is a German pharmaceutical company,
  • 02:32and I still maintain some
  • 02:34affiliations with Brown,
  • 02:35having completed my PhD there.
  • 02:38So first, I wanted to just talk a little bit,
  • 02:41give you a little bit of
  • 02:43an introduction to myself.
  • 02:43So as Helen mentioned,
  • 02:45I come from a very quantitative
  • 02:48dry lab background.
  • 02:50I received my undergraduate degree in
  • 02:52computer science and math and did a
  • 02:55master's degree in applied mathematics.
  • 02:57And in those years, a long,
  • 03:00long time ago,
  • 03:01I was interested in modelling
  • 03:04biophysics and in particular,
  • 03:06I was interested in developmental
  • 03:08and cancer biology.
  • 03:09So I spent a lot of my formative
  • 03:12years putting together agent based
  • 03:15models to simulate cell migration
  • 03:18and cell morphology and emergence of
  • 03:21different types of migratory patterns
  • 03:24in normal healthy tissue and also
  • 03:28various kinds of tumors and you know,
  • 03:32across development and embryogenesis.
  • 03:35Subsequently,
  • 03:36I I moved over to Brown University
  • 03:40where I was in the data science and
  • 03:44biomedical engineering departments.
  • 03:45And it was at Brown University
  • 03:48where I became fascinated with some
  • 03:51mathematical concepts to do with shape.
  • 03:54And I realized during these years
  • 03:58that learning the shape of the data
  • 04:01and being able to quantify the shape
  • 04:04of the data can be a really powerful
  • 04:08tool for biomedical data analysis.
  • 04:11And so for instance,
  • 04:12if you have a bunch of data
  • 04:15points that are arranged in this
  • 04:17weird double Taurus like shape,
  • 04:19being able to actually take these
  • 04:22individual data points and be
  • 04:24able to fill in the empty spaces.
  • 04:27And to be able to recognize that
  • 04:29there are two big holes in the data
  • 04:31and that our data has a loop like
  • 04:34structure where there's a bigger loop
  • 04:36around which our data points are organized.
  • 04:39And then there are smaller loops
  • 04:41that surround those bigger loops.
  • 04:43Being able to recognize these types of
  • 04:46patterns can be extremely powerful,
  • 04:48especially when we are dealing
  • 04:50with biological data.
  • 04:51And then recently moved to postdoc
  • 04:56in in genetics and also I guess
  • 04:59in computer science
  • 05:00at at Yale University.
  • 05:02And I spent my postdoc thinking a lot
  • 05:05about data that is structured like a graph.
  • 05:08And So what I mean by a graph
  • 05:11here is a set of nodes and edges.
  • 05:14So the nodes being represented as the
  • 05:16circles and edges being the lines that
  • 05:19are connecting different nodes together.
  • 05:21There's lots and lots of data out there
  • 05:24that can be represented in this format.
  • 05:27For example, if you are looking
  • 05:29to do drug discovery,
  • 05:31you can represent molecules using
  • 05:34nodes and and edges corresponding
  • 05:37to atoms and bonds respectively.
  • 05:40You can take protein sequences,
  • 05:41fold them with alpha fold and then
  • 05:44represent protein structure in this manner.
  • 05:46But also you can take neuroscience
  • 05:50data such as brain imaging data and
  • 05:54divide brain into different parcels
  • 05:56and learn to represent brain imaging
  • 05:59data in this format where the nodes
  • 06:02are going to represent different
  • 06:03parcels or regions of the brain.
  • 06:06And the edges could be anatomical
  • 06:09connectivity between those
  • 06:10parcels in the brain,
  • 06:12or it could be due to functional connectivity
  • 06:14between those parcels in the brain.
  • 06:16Maybe at an even higher level,
  • 06:18one can think of like taking biomedical
  • 06:21data in general and representing it in
  • 06:23in the format of a knowledge graph,
  • 06:26where you can bring in data from,
  • 06:28you know,
  • 06:30publications from single cell sequencing
  • 06:34experiments and other modalities
  • 06:35and represent all of that data in
  • 06:38a large biomedical knowledge graph.
  • 06:40So during my postdoc,
  • 06:42I've developed techniques to not
  • 06:44only represent data as graphs,
  • 06:46but also to develop machine learning
  • 06:49techniques to learn to reason about
  • 06:52these kinds of graphs and represent
  • 06:54them in a way that a computer can
  • 06:56understand the structure of the
  • 06:58graph and take advantage of that
  • 07:00to answer all kinds of questions.
  • 07:02And so today I'm going to talk
  • 07:05to you about a technique that is
  • 07:09utilizing this graph structure and
  • 07:11combining it with aspects of topology,
  • 07:15which is essentially the technique that
  • 07:17allows us to recognize the shape of our data.
  • 07:20And so to motivate this,
  • 07:23this technique that I'm going to
  • 07:24talk about and we're going to
  • 07:26develop over the next few workshops,
  • 07:28I wanted to share with you some
  • 07:30time lapse microscopy images that
  • 07:33were taken a long time ago.
  • 07:36And these are calcium imaging data
  • 07:39sets of a developing zebrafish embryo.
  • 07:43And if I could play these,
  • 07:46not sure if I can hang on a second,
  • 07:52OK, if I play these images,
  • 07:55What you'll notice here is that on the left,
  • 07:58you have a zebrafish embryo
  • 08:00that's early in its development.
  • 08:02In the middle, it's grown a little bit more.
  • 08:05And on the right,
  • 08:06the zebrafish embryo is much
  • 08:07further along in its development.
  • 08:09What you're going to notice is
  • 08:11that the signalling patterns,
  • 08:12the calcium signalling patterns across
  • 08:15development look very different.
  • 08:17In the video on the left,
  • 08:19early in development we see that
  • 08:22we see individual spiking events,
  • 08:24so individual calcium signaling
  • 08:26events that are not really
  • 08:28correlated temporarily or specially.
  • 08:31A little bit further along in development,
  • 08:34we start to see patches of
  • 08:36synchronous activity in the embryo.
  • 08:39So you see these small patches,
  • 08:40but they don't really travel very far.
  • 08:43And even later in development you
  • 08:45start to see these waves traveling
  • 08:48wave like patterns where you have
  • 08:50calcium signaling starting at a
  • 08:52small group of cells and that
  • 08:54really kind of expands and goes all
  • 08:56across the embryo of the zebrafish.
  • 08:58And even today,
  • 09:00although we have really nice
  • 09:02techniques for being able to capture
  • 09:05this kind of imaging techniques,
  • 09:07we don't really have good quantitative
  • 09:11tools to be able to analyze the
  • 09:15spatial temporal patterns that
  • 09:17we see in these videos.
  • 09:19Likewise for brain imaging,
  • 09:21we have really well developed
  • 09:23tools through fMRI nears EEG,
  • 09:26all kinds of tools to be to image the brain.
  • 09:33And we see here for example in this in
  • 09:37this example that the brain activity
  • 09:39patterns that we get in a healthy
  • 09:42typically developed human and an
  • 09:44individual who's suffering from Alzheimer's,
  • 09:47they're very different.
  • 09:48We don't really have a good tool,
  • 09:52tool set to be able to analyze
  • 09:54the spatial temporal dynamics that
  • 09:57we are seeing across the brain.
  • 09:59So this problem of like quantifying
  • 10:03dynamics both spatially and temporally,
  • 10:06this exists not just at a cellular
  • 10:09and tissue scale in biology,
  • 10:12but also at a systems and organ
  • 10:15scale in neuroscience.
  • 10:16And this is something that
  • 10:18we wish to address.
  • 10:20And So what are some of the challenges
  • 10:23in these data sets and how do we go
  • 10:26from these noisy high dimensional
  • 10:29neuroimaging data sets to neural insights?
  • 10:32And what I mean by neural insights
  • 10:35here is really figuring out
  • 10:37patterns of activity both spatially
  • 10:40and temporally in the brain that
  • 10:43correspond to various kinds of stimuli,
  • 10:46various kinds of diseases,
  • 10:49and various kinds of like tasks.
  • 10:52And so ideally what we want to
  • 10:54be able to do is,
  • 10:55is to build a network that can take
  • 10:59in patterns of brain activity and say
  • 11:02that this pattern of brain activity
  • 11:05corresponds to somebody who's maybe
  • 11:08clicking their right thumb like this.
  • 11:12And so the challenge is,
  • 11:14is enormous because when we look
  • 11:17at this kind of data,
  • 11:19and this is again a brain imaging
  • 11:21data set here,
  • 11:22if you visualize the data,
  • 11:23we see that there is a lot of noise
  • 11:27in this data set.
  • 11:29If we take just one voxel of
  • 11:31this brain imaging
  • 11:32data set and we visualize it over time,
  • 11:35we see that we don't really see this nice
  • 11:38clean line that we would like to see.
  • 11:40In fact, we see that the
  • 11:41data is all over the place.
  • 11:43So we have to learn to be
  • 11:45able to denoise this data set.
  • 11:49The second thing we want to do is
  • 11:51we want to be, we want to learn
  • 11:54salient features of the data set.
  • 11:56So in these neuroimaging data sets and also
  • 11:59in calcium imaging and other data sets,
  • 12:02not all features of the
  • 12:05image are equally important.
  • 12:07There are some features of the image
  • 12:09that are salient to the task at hand,
  • 12:11whether it's to diagnose individuals
  • 12:13or to learn what kind of stimulus they
  • 12:17they they're experiencing or to learn
  • 12:20how to decode their brain activity into
  • 12:23whatever stimulus that they experienced.
  • 12:26So distilling the state space of the
  • 12:29brain and learning salient features
  • 12:31of this data set is very important.
  • 12:35And finally, in neuroimaging in particular,
  • 12:39we are always challenged by spatial
  • 12:42versus temporal resolution.
  • 12:44So we have techniques such as
  • 12:47EEG which have very,
  • 12:49very good temporal resolution but
  • 12:52have very poor spatial resolution.
  • 12:55On the other hand,
  • 12:57we have techniques such as fMRI where
  • 12:59the spatial resolution is amazing.
  • 13:01We get thousands and thousands
  • 13:03of voxels across the brain,
  • 13:05but the temporal resolution of fMRI at
  • 13:09around .5 Hertz is very low compared to EEG.
  • 13:13So we want to develop techniques
  • 13:16that can bridge bridge the gap
  • 13:18between high spatial resolution
  • 13:20and high temporal resolution.
  • 13:22And we want to develop techniques
  • 13:25that can perhaps integrate multiple
  • 13:27modalities of data together so we can
  • 13:30benefit from both high spatial resolution
  • 13:33and also high temporal resolution.
  • 13:38So those were just some of the motivating
  • 13:41factors that in our lab led to the
  • 13:44development of a technique called
  • 13:48GSTHGSTH stands for geometric
  • 13:51scattering trajectory Homology.
  • 13:53And it's a bit of a mouthful.
  • 13:55And over the next two or three workshops,
  • 13:59we are going to go into all the
  • 14:02components that form this methodology.
  • 14:05And so today, just to begin with,
  • 14:07I'll just give you a very short
  • 14:10introduction for for what,
  • 14:11for how this methodology kind of works.
  • 14:14And so in this method,
  • 14:16we start by creating a graph from our data.
  • 14:20If we are dealing with
  • 14:22some calcium imaging data,
  • 14:24like imagine you are imaging
  • 14:27calcium from the primary visual
  • 14:29cortex of of a mouse and you're
  • 14:31maybe imaging in like layer 4.
  • 14:33Let's say you're going to
  • 14:36get a sequence of images.
  • 14:38And what you can do is you can use
  • 14:41existing tools to segment those images
  • 14:44so you know where the cells are located.
  • 14:47And then you can build a graph.
  • 14:49And by graph again,
  • 14:51I mean nodes and edges by using the
  • 14:54centroids of all the cells as nodes in
  • 14:57the graph and putting an edge between
  • 15:00any pair of nodes that share an edge.
  • 15:03So any two cells that are adjacent
  • 15:05to each other will be two nodes
  • 15:08connected by an edge in the graph.
  • 15:10Similarly,
  • 15:11if you have some neuro imaging data set
  • 15:14that you're looking to analyze with GSTH,
  • 15:18what you can do is you can take the
  • 15:21brain and you can convert it into
  • 15:24parcels using your favorite Atlas.
  • 15:26And so those individual parcels of the
  • 15:29brain will form the nodes in the graph.
  • 15:33And we are going to put an edge between
  • 15:36any pair of parcels that are anatomically
  • 15:39close to each other in the brain.
  • 15:42So we start with a graph construction.
  • 15:45Now each node in the graph will have
  • 15:48a signal assigned to it and the signal
  • 15:51is going to be a time varying signal.
  • 15:54In the case of calcium imaging,
  • 15:57for example,
  • 15:57we are going to have as our signal
  • 16:01the calcium activity over time.
  • 16:04In the case of neuro,
  • 16:06neuro imaging data sets,
  • 16:07we are going to have averaged voxel
  • 16:11activations within each parcel as
  • 16:14our time lapse signal on the graph.
  • 16:18And so in GSTH,
  • 16:20what we do is we take that graph
  • 16:22and we use some techniques in graph
  • 16:25signal processing to convert the time
  • 16:28lapse signal on the graph into some
  • 16:31kind of numerical representation.
  • 16:33So think of like taking this
  • 16:35time lapse signal on the graph
  • 16:37and coming up with a vector,
  • 16:39which is nothing but a sequence
  • 16:42of numbers that represent how that
  • 16:45signal is distributed in the graph.
  • 16:49And we're going to cover how this
  • 16:52graph signal processing happens
  • 16:53in the next workshop.
  • 16:55But assuming you can do that,
  • 16:57the next step in our in our methodology
  • 17:00is to construct a trajectory of
  • 17:05the dynamics using some non linear
  • 17:08dimensionality reduction techniques.
  • 17:10And again,
  • 17:10this is something that we will cover
  • 17:12in detail in subsequent workshops.
  • 17:14But what's happening here is that
  • 17:17we are representing the time
  • 17:19lapse data that we started with
  • 17:21in through a low dimensional trajectory.
  • 17:24So in this case, I'm showing you a 3D
  • 17:28trajectory and it's colored by time.
  • 17:30And so we are saying that we start
  • 17:32over here and we kind of move around,
  • 17:35we go in a circle and we end up
  • 17:38in this region of the space.
  • 17:40And so recall how I talked to you
  • 17:43earlier about denoising and learning
  • 17:45the state space as being important
  • 17:49challenges in in neural in neuroscience.
  • 17:51Well, this graph signal processing
  • 17:54in Step 2 effectively denoises
  • 17:57the data set that we started with.
  • 18:00And these trajectories,
  • 18:02these low dimensional trajectories allow
  • 18:05us to quantify where in state space we are.
  • 18:10In particular,
  • 18:11what I want to emphasize is that
  • 18:14within these trajectories,
  • 18:16anytime you get a loop in the trajectory,
  • 18:20that means that your underlying
  • 18:22signaling pattern has some kind
  • 18:25of periodicity attached to it.
  • 18:28Because we a loop structure in this
  • 18:31low dimension indicates that we end
  • 18:33up at the same state or close to the
  • 18:37same state where we started from.
  • 18:39So these trajectories are really quite
  • 18:42informative and we can interpret the
  • 18:45shape of these trajectories by looking
  • 18:48at looking at our data through the
  • 18:50lens of periodicity and quasi periodicity.
  • 18:54So we recognize that the shape of
  • 18:56these trajectories is very important.
  • 18:58And in order to be able to compare
  • 19:01across different data sets and across
  • 19:03different subjects in an experiment,
  • 19:05we need to find a way of quantifying
  • 19:09the shape of the trajectory.
  • 19:11And to quantify the shape of the trajectory,
  • 19:14we use our topological data
  • 19:17analysis as our main tool.
  • 19:19And so topological data analysis is a
  • 19:21technique that I'm going to cover today,
  • 19:24which takes point cloud data.
  • 19:27What I mean by point cloud data is just a
  • 19:29bunch of points sitting in some dimension.
  • 19:32In this case,
  • 19:32these points are all in like 3 dimensional
  • 19:35space and it it converts them into
  • 19:38something called a persistence diagram.
  • 19:40And this persistence diagram quantifies
  • 19:43how connected those points are.
  • 19:46And it also quantifies the shape of
  • 19:49this data in the sense that it measures
  • 19:53how loopy the trajectory is and whether
  • 19:56or not that trajectory has any holes in it.
  • 19:59So again,
  • 20:00this might sound very abstract at this stage,
  • 20:02but this is a technique that I'm going
  • 20:04to talk about in more detail today,
  • 20:06topological data analysis.
  • 20:07And what we do then is we can take
  • 20:10these topological features that are
  • 20:13capturing the shape of our trajectory,
  • 20:15and we can put them through some
  • 20:18machine learning in order to be able
  • 20:20to use GSTH as a diagnostic tool,
  • 20:23for example.
  • 20:24So what machine learning will do is it
  • 20:26will take these topological features.
  • 20:29And it will classify whether or not
  • 20:32the individual that we are looking at
  • 20:35is a typically developed individual
  • 20:37or whether they have schizophrenia
  • 20:40or they have OCD or Alzheimer's.
  • 20:43What you can also do is you can
  • 20:45use this technique to figure out
  • 20:47whether or not the brain,
  • 20:49is it in the resting state or
  • 20:51is it engaged in some task.
  • 20:52You can learn to figure out what task
  • 20:55an individual is doing by quantifying
  • 20:58the shape of these trajectories.
  • 21:00And there are many,
  • 21:01many other application areas that I'm
  • 21:04sure you can think of applying this to.
  • 21:07So just want to start with the
  • 21:10workshop organization briefly.
  • 21:12So we have two other fantastic speakers
  • 21:16for our workshop, Rahul Singh,
  • 21:19he's in the audience today and Brian
  • 21:22Zabowski is also in the audience today.
  • 21:24Rahul is a WOOSAI postdoctoral fellow.
  • 21:27He will be talking to you next week
  • 21:30and he'll be talking about graph
  • 21:33signal processing methods that form
  • 21:35the second step of our methodology.
  • 21:37And then the following week,
  • 21:39me and Brian,
  • 21:41we will jointly present to you the
  • 21:43entirety of the GSTH technique and we'll
  • 21:47share with you several applications
  • 21:49of GSTH both for cellular imaging data
  • 21:53sets and also for neuro imaging data sets.
  • 21:56And then of course, Helen will be around
  • 21:59to facilitate all of these workshops.
  • 22:01She's really the brains behind the operation.
  • 22:04And so we have the 1st 3 workshops to cover
  • 22:08different aspects of the GSTH methodology.
  • 22:11We are starting from the end.
  • 22:13So I'm going to talk about topological
  • 22:15data analysis and machine learning today.
  • 22:17Rahul within talked about
  • 22:19graph signal processing.
  • 22:20In the third workshop,
  • 22:22we'll bring these things together
  • 22:24and go over the complete GSTH
  • 22:27methodology and its applications.
  • 22:29And then the final week of the workshop,
  • 22:31we are going to do a hands on
  • 22:34tutorial where you'll get to load
  • 22:37some neuro imaging data set and
  • 22:40also cellular imaging data set and
  • 22:43analyze it using GSTH in Python.
  • 22:45And at the moment I think
  • 22:47we're planning to make,
  • 22:48we're planning to hold our 4th
  • 22:50workshop as like a hybrid workshop
  • 22:52that might have in person component.
  • 22:55So we'll get back to you on on that and
  • 22:58the location for that in subsequent weeks.
  • 23:01Yes, I will send the series of
  • 23:03emails where people can sign up
  • 23:06for in person component. Great.
  • 23:10All right. So we have few live participants.
  • 23:13I understand that the majority of these
  • 23:16workshops get viewed online over a period
  • 23:18of like weeks and months and years.
  • 23:21So please feel free to stop me
  • 23:23anytime and to ask questions.
  • 23:25And so as I mentioned, I'm going to start
  • 23:29with with topological data analysis.
  • 23:31And depending on how much time I
  • 23:33have available to me, I might,
  • 23:35I will also cover some fundamentals
  • 23:37of machine learning just to make sure
  • 23:40that everybody is on the same page
  • 23:42and we all share the same vocabulary
  • 23:44in the weeks going forward.
  • 23:46So let's start with TDA.
  • 23:49And I wanted to start by just showing
  • 23:51you some of these point cloud examples.
  • 23:54And so when I look at these data sets,
  • 23:56what I see is that maybe in the first
  • 23:59data set, we have two variables.
  • 24:01Maybe we have an independent variable
  • 24:04and a dependent variable that are
  • 24:06strongly correlated together.
  • 24:08And this to me looks kind of like
  • 24:10a linear correlation,
  • 24:11like a regression type of data set.
  • 24:14When I look at the second data set here,
  • 24:17what I'm recognizing is that
  • 24:19the data set is clustered.
  • 24:21We have a bunch of points that are
  • 24:24grouped together and we have kind of
  • 24:27three clusters of data in this data set.
  • 24:29The third data set,
  • 24:31to me looks cyclical.
  • 24:33I can spot a circle in this data set,
  • 24:37and that might indicate that perhaps
  • 24:39this is some time lapse data set.
  • 24:41Maybe there's some kind of oscillatory
  • 24:43nature to this data set,
  • 24:45and maybe we're going around in
  • 24:47circles and the last data set
  • 24:49here has this kind of Y shape.
  • 24:52It looks like it's kind of branching out.
  • 24:55This could be maybe some stem cells
  • 24:57down here that are, you know,
  • 24:59differentiating into two different lineages.
  • 25:02It seems to have this tree like
  • 25:04hyperbolic structure to it.
  • 25:06And so our brains are really,
  • 25:08really good at recognizing the
  • 25:11shape of the data,
  • 25:13especially when the data is presented
  • 25:15to us in these low dimensions.
  • 25:17And we understand fundamentally
  • 25:19that any data that we have,
  • 25:22that data has some shape,
  • 25:24and the shape carries some meaning.
  • 25:27And this really is the central
  • 25:30tenet of topological data analysis,
  • 25:33which is a branch of applied
  • 25:36mathematics and computer science
  • 25:38that has to do with understanding
  • 25:41fundamentally the shape of our data.
  • 25:44And underlying all of this is what
  • 25:47we call the manifold hypothesis.
  • 25:50The idea being that any scientific
  • 25:53data that we collect in our lab is
  • 25:57it might look very noisy and it
  • 25:59might be very high dimensional.
  • 26:01But quite often that scientific data
  • 26:04is sampled from some low dimensional
  • 26:07manifold And what we are really after
  • 26:10is to understand what that manifold
  • 26:13looks like and what the intrinsic
  • 26:16dimension of that manifold is.
  • 26:19So in this example here our manifold
  • 26:21looks to be kind of saddle shaped and
  • 26:25it has these two curvature areas.
  • 26:27So it has a direction of positive curvature,
  • 26:30a direction of negative curvature,
  • 26:32and our data is simply
  • 26:34sampled from this manifold.
  • 26:36So what we really want to understand
  • 26:38is the shape of the manifold.
  • 26:40Another way to look at this is
  • 26:42what we get in our experiments
  • 26:45are individual data points,
  • 26:48and those data points all
  • 26:50together form some kind of shape.
  • 26:52And what we really want to see
  • 26:54is what that shape looks like.
  • 26:56So in this case,
  • 26:57all these data points form a
  • 26:59torus and this is kind of,
  • 27:01this is the realization that we
  • 27:03are going to come to is that
  • 27:05our data is arranged in the
  • 27:07shape of a doughnut or a Taurus.
  • 27:09So how do we actually go about doing that?
  • 27:12Let me share with you the methodology
  • 27:16using some very simple data sets that
  • 27:19are easy to plot in A2 dimensional slide.
  • 27:22And so we'll be working with these two
  • 27:25data sets for the next few slides.
  • 27:27The data set on the left,
  • 27:29I'm going to call the concentric
  • 27:31circles data set and and that's
  • 27:34simply in recognition of the fact
  • 27:36that these points are sampled
  • 27:38from 2 circles where one circle
  • 27:41is within another circle.
  • 27:42And the data set on the right,
  • 27:45I'm going to call the half moons data
  • 27:48set simply because both of these,
  • 27:51we have kind of two arcs in our
  • 27:53data and they both look like kind
  • 27:55of half moons or Crescent moons.
  • 27:58And So what we want to do is we
  • 28:00want to use a technique to recognize
  • 28:02the fact that our data on the left
  • 28:05is arranged in two circles.
  • 28:07And the data on the right,
  • 28:08it looks kind of circular,
  • 28:10but it's not really two circles
  • 28:14or one circle for that matter.
  • 28:16And so one thing you might want
  • 28:17to do is you want to,
  • 28:18you might consider it like using a
  • 28:20clustering method to see if that works,
  • 28:22right?
  • 28:22So you could take those data points
  • 28:25and throw them into an algorithm,
  • 28:27maybe something similar to K means.
  • 28:29And you might see like,
  • 28:30OK, does the data cluster?
  • 28:32Well,
  • 28:32if you run this data set through K means,
  • 28:35you'll end up with these clusters,
  • 28:37the blue cluster and the audience cluster.
  • 28:40And these two clusters don't really
  • 28:42tell you the true story behind the data.
  • 28:45In particular,
  • 28:46they don't recognize the fact that
  • 28:48these data are arranged in two circles.
  • 28:50And we,
  • 28:51we even get some miss clustering
  • 28:53happening in the data set on the right.
  • 28:56Now you might then go back and say that,
  • 28:57OK,
  • 28:58I should use a different sort
  • 29:00of clustering technique.
  • 29:02Maybe I can cluster the data by its density.
  • 29:05And so when you employ a density based
  • 29:08clustering methods such as DB scan,
  • 29:10you do indeed get the correct
  • 29:13cluster labels for your data.
  • 29:15You are able to separate data
  • 29:16points in the inner circle from
  • 29:18data points in the outer circle,
  • 29:20and you are able to separate the
  • 29:22data points belonging to the upper
  • 29:24Crescent moon and the lower Crescent moon.
  • 29:27Even then, the machine doesn't really know
  • 29:30that the data is arranged as circles.
  • 29:33It has no recognition of that.
  • 29:35It has simply learned that your data
  • 29:37is clustered into these two groups,
  • 29:39but it doesn't fundamentally understand.
  • 29:42What we can tell immediately is that this
  • 29:45data is arranged in a circular pattern.
  • 29:48And so this is where topology comes in.
  • 29:52And so I'm going to talk
  • 29:54to you about topology.
  • 29:55And because I'm a very visual learner,
  • 29:58I'm going to use some animations and
  • 30:00some figures to kind of demonstrate how
  • 30:03topology works without necessarily going
  • 30:06into all the math and all the code behind it.
  • 30:09We'll get to, we'll use some of
  • 30:12the code in our third workshop.
  • 30:14But honestly, like the code is
  • 30:16something you import and you use.
  • 30:18And so I think it's much more important to
  • 30:21kind of build intuition around topology.
  • 30:24So what we do in in in topology is we
  • 30:27build something called simplicial complexes.
  • 30:31And there's a number of different kinds of
  • 30:34simplicial complexes that one can build.
  • 30:36But I'm going to talk about the viatorius
  • 30:39ribs simplicial complex to begin with today.
  • 30:42And so to create a Viatorius ribs
  • 30:45simplicial complex from your data,
  • 30:47what you do is you start with a given
  • 30:51data point and you imagine a disk of some
  • 30:55radius epsilon around that data point.
  • 30:58And you do this for every other
  • 31:00data point in the data set.
  • 31:02And you're going to grow this
  • 31:05epsilon radius disk over time.
  • 31:08And what you're going to do is when
  • 31:11two epsilon radius disks intersect
  • 31:13with each other, so they overlap,
  • 31:16you're going to draw an edge between
  • 31:19those two data points creating A1 simplex.
  • 31:23When you have three points shown
  • 31:26here as AB and C,
  • 31:28and they their epsilon discs all
  • 31:31intersect in a pair wise manner,
  • 31:34then we're going to draw a filled
  • 31:36in triangle which we are going to
  • 31:39call A2 simplex.
  • 31:40And then in higher dimensions,
  • 31:42when we have four data points all
  • 31:44intersecting in a pair wise manner,
  • 31:47then we're going to draw a three simplex.
  • 31:50So we are going to take our data set.
  • 31:52In this case the data set happens to
  • 31:55be two-dimensional and we are we are
  • 31:58constructing these simplices from our
  • 32:00data by expanding these epsilon radius
  • 32:03discs around each point in the data set.
  • 32:06And so in this visualization here,
  • 32:09I'm simply showing you the 0
  • 32:11simplex which which are the data
  • 32:12points that we started from,
  • 32:14and the one simplex which are all
  • 32:17the edges that get created as we are
  • 32:20expanding this epsilon radius disk.
  • 32:22I'm not showing you the disk and I'm
  • 32:25not showing you the field in triangles
  • 32:27or the tetrahedrons simply because the
  • 32:30the figure gets very, very crowded.
  • 32:34So why do we want to construct this
  • 32:38via Torres Ribs complex?
  • 32:40Well,
  • 32:41it turns out if you have some data shown
  • 32:44here as these red dots that are sampled,
  • 32:49this is your experimental data,
  • 32:51and you imagine that this
  • 32:53data is coming from some
  • 32:55kind of underlying manifold.
  • 32:57So there's a recognition here that
  • 32:59whatever data we sample comes from
  • 33:01a manifold that has maybe two holes
  • 33:03in the middle of it, it turns out,
  • 33:06and there's a theorem to prove this,
  • 33:08although we're not going to go through.
  • 33:10The proof of the theorem is that if your
  • 33:13data is well sampled, so all these X,
  • 33:16the points in X are sampled
  • 33:18throughout the manifold quite well,
  • 33:20then when you construct the
  • 33:23viatorus rips complex from this
  • 33:27data set for some radius epsilon,
  • 33:30then this viatorus rips complex is basically
  • 33:33equivalent to the underlying manifold.
  • 33:36So in kind of more intuitive terms,
  • 33:40what this theorem is saying is
  • 33:42that if you want to learn the
  • 33:45shape of your manifold where the
  • 33:47data is being sampled from,
  • 33:49it is sufficient to construct a Vietorisrips
  • 33:53complex at some radius epsilon.
  • 33:56And you will be able to find the manifold
  • 33:59underneath the data and you'll be able
  • 34:01to recognize the fact that your data
  • 34:03is forming this one connected object
  • 34:06which has two holes punched into it.
  • 34:10OK, so let's get back to our example,
  • 34:13the concentric circles example
  • 34:15and the half moons example.
  • 34:17And so here I'm showing you those
  • 34:20epsilon radius discs around the data.
  • 34:23So we have epsilon equals
  • 34:26.05 at the beginning,
  • 34:28we increase our epsilon value.
  • 34:31And as we increase the epsilon value,
  • 34:33these discs that I'm plotting in grade,
  • 34:35they get bigger and bigger until
  • 34:38they cover the whole space.
  • 34:40And So what you can recognize here
  • 34:43is that when our disk is quite small,
  • 34:46even at epsilon equals .05,
  • 34:49all the little points that
  • 34:51are in the inner circle,
  • 34:53they all get connected together
  • 34:55because all of those disks are
  • 34:58overlapping with each other.
  • 35:00Then when we increase our epsilon to .15,
  • 35:03the inner circles are still
  • 35:06all connected together,
  • 35:07but now the outer circles outer point.
  • 35:11The points in the outer circle
  • 35:13are also connected together.
  • 35:14So we observe 2 loops in our data.
  • 35:18As epsilon increases even more,
  • 35:21these loops get closed in and they
  • 35:24merge with each other until at the
  • 35:27end when epsilon is really big,
  • 35:29all of the disks intersect with
  • 35:32each other and everything collapses
  • 35:35into just one connected component
  • 35:39in the two half moons data set.
  • 35:42What we see is that there is a value
  • 35:44of epsilon indeed where there is a
  • 35:46small circle that forms as these
  • 35:49points all get connected together
  • 35:50in a pair wise manner.
  • 35:52But that little circle quickly
  • 35:55disappears when epsilon increases
  • 35:57further and these two arcs get
  • 36:00connected together into one hole.
  • 36:02And so this is this technique
  • 36:06is called persistence homology.
  • 36:09And what it gives us is what we
  • 36:12call a topological barcode.
  • 36:15So there is code out there that will
  • 36:18take these data points as an input,
  • 36:21doesn't have to be two-dimensional
  • 36:22or three-dimensional,
  • 36:23could be high dimensional data
  • 36:25and it will perform this kind of
  • 36:28computation and give you back
  • 36:29a visual that looks like this,
  • 36:32which is the topological barcode.
  • 36:34So let's kind of go through the
  • 36:36topological barcode and learn
  • 36:38how to interpret the barcode.
  • 36:40The barcode consists of two parts.
  • 36:42The top half I'm going to call H sub
  • 36:45zero for dimension 0 homology and the
  • 36:48lower part I'm going to call edge
  • 36:51sub one for dimension 1 homology.
  • 36:53And so in dimension 0 homology,
  • 36:56we are measuring connectedness of
  • 36:58our data and we generally call this
  • 37:02number of connected components.
  • 37:03And So what you can see here is
  • 37:07that when epsilon is close to 0,
  • 37:09where my cursor is,
  • 37:11we see lots and lots of bars in our data set.
  • 37:14And these bars correspond to how
  • 37:17many connected components there
  • 37:19are in our data set.
  • 37:21So when epsilon is 0,
  • 37:22all the points are sitting by themselves,
  • 37:25none of the points are
  • 37:27connected to each other.
  • 37:28So we get as many bars as the
  • 37:31number of points in our data.
  • 37:33As epsilon starts increasing,
  • 37:35we start merging together points
  • 37:38by connecting them with an
  • 37:41edge and forming A1 simplex.
  • 37:43So as epsilon is increasing here
  • 37:45you can see that the number of
  • 37:48bars is fewer and fewer until at
  • 37:51high values of epsilon we end up
  • 37:53with just one bar in our barcode.
  • 37:57So this dimension 0 homology,
  • 38:00this is capturing the connectivity of
  • 38:02our data and by looking at the slope
  • 38:06by which these bars are decreasing in number,
  • 38:09we can figure out how connected
  • 38:11data our data set really is.
  • 38:14In dimension 1,
  • 38:15which is at the bottom of this barcode,
  • 38:19what we are measuring is the
  • 38:21presence of loops in our data set.
  • 38:23So at epsilon equal to 0 on the very left,
  • 38:26we have no bars in the lower
  • 38:28part of this diagram,
  • 38:30which means there are no loops
  • 38:32present at that value of epsilon.
  • 38:35At a later value of epsilon we
  • 38:38see the occurrence of this 1st
  • 38:40loop from these orange points
  • 38:42in the inner concentric circle.
  • 38:44That loop persists for a long period of time.
  • 38:49What I mean by time is it persists
  • 38:51for a large range of epsilon values.
  • 38:55During this process,
  • 38:56there is a second loop that forms,
  • 38:59indicated by the second red bar.
  • 39:01Here it emerges at at a higher
  • 39:04value of epsilon,
  • 39:06and this outer loop dies sooner
  • 39:08than the inner loop does.
  • 39:10The inner loop persists for even longer.
  • 39:13So by looking at the bars in our bar code,
  • 39:17we can learn that our data has so
  • 39:20many points simply by counting the
  • 39:23number of bars at epsilon equal to 0.
  • 39:26We can learn how connected our data
  • 39:29set is by looking at how these bars
  • 39:32disappear as epsilon increases.
  • 39:34And then in the lower part of the bar code,
  • 39:37by looking at these bars,
  • 39:40we can learn how many loops
  • 39:42are present in our data.
  • 39:43In particular,
  • 39:44the bars that are longer in length
  • 39:47actually represent actual loops
  • 39:49that are present in our data.
  • 39:51There are indeed some smaller
  • 39:54bars which are small noisy
  • 39:56loops that form as we perform this procedure.
  • 40:00And what's apparent from these
  • 40:02two barcodes is that in our first
  • 40:05example with the concentric circles,
  • 40:07there are two clear loops in that data.
  • 40:11And in our second example, there is
  • 40:13indeed a small loop that emerges here,
  • 40:16but it quickly disappears,
  • 40:18so there's no really topologically
  • 40:21significant loops present
  • 40:22in the second data set.
  • 40:24And so these bar codes capture
  • 40:26the shape of our data.
  • 40:29You can continue to plot H2 and H3
  • 40:33which are going to capture higher
  • 40:35dimensional holes in your data.
  • 40:37So H2 is going to capture 3 dimensional
  • 40:41holes or voids in the data.
  • 40:43H4 will capture even higher
  • 40:45dimensional holes in the data.
  • 40:46So topology captures the shape of our
  • 40:49data by measuring connectedness and
  • 40:50the presence of loops in the data.
  • 40:53Are there any questions?
  • 40:55Yes, Jay, just to kind of translate
  • 40:57math into more intuition.
  • 40:59When you say you have holes or loops,
  • 41:02you're pretty much talking
  • 41:04about some impossible states,
  • 41:06meaning that your state cannot have this,
  • 41:09like cannot be in the specific
  • 41:11state for whatever reasons, right?
  • 41:13Yeah, that's a great question.
  • 41:15So I'm talking indeed about
  • 41:17impossible states because these
  • 41:19points are derived from experiments
  • 41:21and they represent the state of our
  • 41:24brain or the state of our tissue.
  • 41:26And therefore if we have a hole in
  • 41:29our data set, that means there's no
  • 41:31data points present in the middle.
  • 41:33And that means that there is that
  • 41:35state is impossible as far as we can
  • 41:38tell from our experimental data.
  • 41:39So that's first conclusion.
  • 41:42The 2nd conclusion which we can get is this.
  • 41:46H1 measures kind of holes in two dimensions.
  • 41:50And so that necessarily means that there
  • 41:54is data that surrounds the hole, right?
  • 41:56There must be some surrounding data and
  • 41:58whenever there is data that surrounds a hole,
  • 42:01that might indicate some kind
  • 42:03of periodicity in the data set.
  • 42:06So you can imagine that if you have data
  • 42:08points that are arranged in a circle,
  • 42:10doesn't have to be a perfect circle, it
  • 42:12could be like an elliptical or skewed circle.
  • 42:15This technique still works.
  • 42:16But that that tells you that
  • 42:19there is there is some sort of process.
  • 42:21Yeah, there's a process that goes
  • 42:23around in in a kind of periodic way.
  • 42:26So you can navigate those that state
  • 42:28space in a way that's periodic or
  • 42:31almost periodic or quasi periodic.
  • 42:34So impossible state.
  • 42:35Well, as periodic states are being
  • 42:38captured through dimension 1 homology
  • 42:40in the status in this technique,
  • 42:42indeed, I
  • 42:44also have one question.
  • 42:45So when we are increasing epsilon,
  • 42:48yeah, are there some loops
  • 42:52that are disappearing?
  • 42:54Because if we increase epsilon,
  • 42:56loops should not discover
  • 42:58disappear, right loops
  • 43:00can disappear. So the the way
  • 43:02this outer loop is disappearing
  • 43:04here is when there is a value of
  • 43:07epsilon when one of the disks from
  • 43:10the outer loop intersects with
  • 43:12the disk from the inner loop.
  • 43:14As soon as these two discs
  • 43:16start intersecting,
  • 43:17we draw an edge that goes from
  • 43:19a point in the inner loop to a
  • 43:21point on the outer loop and that
  • 43:24effectively connects that those two
  • 43:26loops together and the the enclosing
  • 43:28space between the inner loop and the
  • 43:31outer loop disappears at that stage.
  • 43:33To get rid of the inner loop you have
  • 43:36to increase epsilon a lot higher
  • 43:38because you have two points that are
  • 43:41opposite to each other in the inner loop,
  • 43:44and when those two points
  • 43:46get connected to each other,
  • 43:47the inner loop closes in and disappears.
  • 43:50So yes, loops can disappear,
  • 43:53and in fact at a high enough value
  • 43:56of epsilon, all loops will close up,
  • 43:59and when epsilon is Infinity,
  • 44:01all the points are necessarily
  • 44:03intersecting with each other and there
  • 44:05are no loops present in the data.
  • 44:07Maybe this is a little bit more apparent
  • 44:09in this previous animation when I was
  • 44:12drawing the one simplex where you
  • 44:14can see that at a higher value of epsilon,
  • 44:16we do indeed end up closing the outer
  • 44:19loop by connecting it to the inner loop.
  • 44:22So you see that there's a bridge that forms
  • 44:24from the outer loop to the inner loop,
  • 44:25and that empty space here closes in
  • 44:28and at a very high value of epsilon,
  • 44:30there will be edges that
  • 44:32go across the inner loop,
  • 44:34closing the inner loop entirely.
  • 44:36Although that doesn't happen in
  • 44:38this animation because I didn't
  • 44:40increase epsilon high enough.
  • 44:43I suppose the question
  • 44:45there is a, there is a question in the chat.
  • 44:48And Zachary, would you like
  • 44:49to ask your question live or
  • 44:51would you like me to read that?
  • 44:53I think I can read it.
  • 44:54So the question says earlier you
  • 44:57mentioned sufficient sampling of points
  • 44:59as necessary to interpret a manifold.
  • 45:01What you're showing now seems to
  • 45:03address properties of the manifold.
  • 45:04Is there a property to address how much
  • 45:07variability is or isn't accounted for by
  • 45:10this manifold characterization process,
  • 45:12possibly due to under sampling?
  • 45:14That's a great question.
  • 45:16So what I'm presenting at the
  • 45:19moment is under the assumption
  • 45:22that our manifold is well sampled.
  • 45:24So indeed, if your experimental
  • 45:28procedure failed to sample a
  • 45:30data point in the middle here,
  • 45:32my conclusion would be that your
  • 45:34data set has cellular states
  • 45:36organized into two circles,
  • 45:39and they're kind of
  • 45:40independent of each other.
  • 45:41There is indeed an outer circle,
  • 45:43and that might be completely wrong
  • 45:45simply because we never sampled data
  • 45:48that exists between these two circles.
  • 45:50So I am indeed operating under the
  • 45:53assumption that the manifold is well sampled.
  • 45:57Another aspect of the question was what
  • 46:00what kind of like what properties of
  • 46:03the manifold am I really capturing?
  • 46:06And so I'm capturing topological
  • 46:08properties of the manifold.
  • 46:10And by topological properties I
  • 46:12mean how connected that manifold
  • 46:14is and whether or not there are
  • 46:16holes in that manifold.
  • 46:18And so this technique is like invariant
  • 46:20to things like translation of the data.
  • 46:23So if I take these circles and
  • 46:25I translate them somewhere else,
  • 46:27that doesn't impact it.
  • 46:29If I take this diagram and I
  • 46:31rotate it by 45° or 30°,
  • 46:33that's not going to change
  • 46:35the bar code at all.
  • 46:36So it's translation invariant,
  • 46:38it's rotationally invariant.
  • 46:40And it's also invariant to certain
  • 46:43kinds of deformations where if I take
  • 46:46this arc and I deform it a little bit,
  • 46:49that's not going to change my barcode
  • 46:51and it's not going to change the
  • 46:54fact that the data is connected
  • 46:56in this Crescent moon shake.
  • 46:58We'll get into more of the details
  • 47:00of what aspects of the manifold we
  • 47:03are capturing as we progress further.
  • 47:04But I hope that kind of goes a little
  • 47:07way towards answering your question.
  • 47:11OK, so we have introduced this
  • 47:14idea of a topological barcode.
  • 47:16Next I want to show you a more convenient
  • 47:20way of representing this barcode
  • 47:22which is called a persistence diagram.
  • 47:26And a persistence diagram is
  • 47:28is very easy to construct.
  • 47:29What you do is you draw two axes,
  • 47:33the the X axis is called the
  • 47:35birth axis and the Y axis is
  • 47:37called the death axis generally.
  • 47:39And these axes are going to represent
  • 47:43when points start in the barcode.
  • 47:45When, when do the bar start in
  • 47:48the barcode and where do they end?
  • 47:50And so you cannot complete or you cannot
  • 47:53end a barcode before starting it.
  • 47:57And because of that,
  • 47:58all the points in this persistence diagram
  • 48:01are going to happen above the diagonal.
  • 48:04And there's two kinds of points here.
  • 48:06I hope you can see that there's
  • 48:08points that are represented as tiny
  • 48:10circles and there are points that
  • 48:11are represented as tiny diamonds,
  • 48:13and so the circles are coming out
  • 48:16of the H 0 dimension 0 homology,
  • 48:19representing the connectedness of the data.
  • 48:21They are all present here on the
  • 48:24left side of the persistence diagram
  • 48:26because all of these bars in H zero
  • 48:29start at epsilon zero and then they end
  • 48:32with some positive value of epsilon.
  • 48:34Therefore,
  • 48:35all of these points here represent
  • 48:38H 0 and the points that are over
  • 48:41here further away from zero,
  • 48:44these are all representing H1.
  • 48:47So I'm simply taking the starting
  • 48:49coordinate of the bar and the
  • 48:51ending coordinate of the bar,
  • 48:52and I'm just representing it
  • 48:54along these two axes.
  • 48:56And this is what's called
  • 48:58a persistence diagram.
  • 49:00The conventional wisdom in persistence
  • 49:03homology and topological data
  • 49:05analysis is that points that are
  • 49:09further away from the diagonal,
  • 49:12they correspond to longer bars
  • 49:14in the bar code,
  • 49:16and those are the topologically more
  • 49:19significant features in our data.
  • 49:21So in this case,
  • 49:23for the concentric circles example,
  • 49:24we see two red diamonds that
  • 49:26are far away from the diagonal,
  • 49:29indicating the presence of two loops
  • 49:32or two circles in our data set.
  • 49:36This is the first,
  • 49:38the topological barcode for the
  • 49:39Half Moons data set,
  • 49:41and this is the corresponding
  • 49:43persistence diagram.
  • 49:44You'll notice that this highlighted
  • 49:46red diamond corresponding to that
  • 49:49tiny loop that emerged for a bit
  • 49:51is actually quite close to the
  • 49:53diagonal in this persistence diagram,
  • 49:55indicating that it's not very significant.
  • 49:58Now,
  • 49:59there are ways to compute statistics
  • 50:01here and to figure out more quantity
  • 50:04quantitatively whether or not a given
  • 50:07topological feature is significant,
  • 50:09but I'm not going to get into that today.
  • 50:11There are bootstrapping methods
  • 50:13that give you a confidence
  • 50:15interval around the diagonal,
  • 50:18and anything that falls inside of
  • 50:20that confidence interval is going
  • 50:23to be insignificant features.
  • 50:25And anything that falls outside
  • 50:27of that confidence interval is
  • 50:29further away from the diagonal,
  • 50:30and it's going to be topologically
  • 50:33significant features.
  • 50:34So those those kinds of tools exist,
  • 50:36but I'm not getting into that today,
  • 50:38just to build intuition.
  • 50:42OK, so here's a small quiz that I like
  • 50:44to do when I maybe present this also to
  • 50:48undergraduates where I have three data
  • 50:51sets and three persistence diagrams,
  • 50:52but I've kind of jumbled them all up.
  • 50:55And so let me just quickly tell you what the
  • 50:57data sets are and what the diagrams are.
  • 50:59I'll give you a moment to think,
  • 51:01I'll drink some water,
  • 51:02and then we'll go over the solution.
  • 51:04So the first data set here,
  • 51:07it's very hard to tell,
  • 51:08but these are two spheres that where I
  • 51:11have data sample from an inner sphere and
  • 51:15a data sample from the outer sphere here.
  • 51:19So it's in three dimensions.
  • 51:21The second example are three circles where
  • 51:23I have two circles that are concentric
  • 51:26with one another and a separate circle
  • 51:29that's outside of these two circles.
  • 51:32And the third one is really hard
  • 51:33to tell in this visualization,
  • 51:35but this is data that's samples
  • 51:37from the surface of a doughnut
  • 51:40or a Taurus in mathematics.
  • 51:42So this is again A3 dimensional object.
  • 51:45It in the lower half,
  • 51:46I'm showing you persistence diagrams where
  • 51:49the black dots are representing dimension
  • 51:520 homology and that's connected components.
  • 51:56The red triangles are representing dimension
  • 52:001 homology which are loops in our data.
  • 52:04And now we have blue diamonds,
  • 52:07and the blue diamonds are
  • 52:09representing H2 dimension 2 homology,
  • 52:12which are three-dimensional
  • 52:14holes or voids in our data.
  • 52:17And so I'd like you to think about
  • 52:20matching these data sets to their
  • 52:24corresponding persistence diagrams.
  • 52:25A hint would be to look at
  • 52:28the blue diamonds first,
  • 52:29because blue diamonds are
  • 52:31indicating 3D empty space.
  • 52:33I'm going to take a quick drink of water
  • 52:35and then we'll go over the solution.
  • 52:49OK, So hopefully folks have realized
  • 52:54that this persistence diagram on the left
  • 52:59doesn't have any blue diamonds in it,
  • 53:02doesn't have any 3D empty space in it,
  • 53:05and therefore it corresponds to the second
  • 53:08data set of the three concentric circles.
  • 53:12You can see that there is 2 red triangles
  • 53:14that are further away from the diagonal here,
  • 53:17and there's one red triangle that's a
  • 53:19little bit away from the diagonal here.
  • 53:22And those three, those 3 triangles
  • 53:25correspond to these three loops.
  • 53:28One of the triangles is quite close
  • 53:30to the diagonal because of the
  • 53:31fact that these are concentric,
  • 53:33so you can kind of bridge across
  • 53:35them quite easily.
  • 53:36Now we have 2 persistence diagrams
  • 53:39that have diamonds in them,
  • 53:41and to figure out which one is which,
  • 53:44I think you have to look at some
  • 53:47of these triangles again.
  • 53:49And So what distinguishes the right one
  • 53:51from the left one is on the right one.
  • 53:53I'm not really seeing any triangles
  • 53:55that are very far from the diagonal.
  • 53:58But in this one,
  • 53:59I see one triangle here that's
  • 54:01far from the diagonal.
  • 54:03And maybe there's another triangle
  • 54:04here that's kind of separated from all
  • 54:07the noise over here that's slightly
  • 54:09further away from the diagonal.
  • 54:11And so the way you can get these two
  • 54:14triangles is because when you have a Taurus,
  • 54:17if you think about a Taurus as a doughnut,
  • 54:20you have a a circle that goes
  • 54:24across the torus,
  • 54:25like a horizontal circle
  • 54:27going across the donor.
  • 54:29And then you have another loop,
  • 54:31another circle that goes kind of
  • 54:34perpendicular to the 1st circle that
  • 54:36goes around the donor in this way.
  • 54:39So then in a donor,
  • 54:40there are two loops and there's
  • 54:43one empty space or one 3D hole.
  • 54:46Whereas in the concentric spheres example,
  • 54:49you just have the empty space between the
  • 54:53two spheres that's shown by this diamond.
  • 54:57So sorry,
  • 54:58there's 22 empty spaces.
  • 54:59There's empty space inside the inner sphere,
  • 55:02and there's empty space between the
  • 55:05outer sphere and the inner sphere.
  • 55:07And that shows up here because you
  • 55:10have one blue diamond up here,
  • 55:12and then maybe you have one blue
  • 55:14diamond here that's slightly
  • 55:15further away from the diagonal.
  • 55:17And so those correspond to the space
  • 55:19inside the inner sphere and the
  • 55:22interstitial space between the two spheres.
  • 55:25So hopefully that helps you build
  • 55:27some intuition.
  • 55:27This is a lot easier to do when
  • 55:29I put confidence intervals,
  • 55:31but if I do put confidence intervals,
  • 55:33we'll have to compute them separately for
  • 55:36dimension 0, dimension 1, and dimension 2.
  • 55:39And so that makes things.
  • 55:40Also,
  • 55:41I don't really have enough space
  • 55:42to to draw 3 persistence diagrams
  • 55:45with three confidence intervals,
  • 55:47but hopefully that makes sense.
  • 55:48If you have any question about how to
  • 55:51interpret these persistence diagrams
  • 55:52or like the solution to this little quiz,
  • 55:55please feel free to chime in.
  • 55:59OK, so now at this point,
  • 56:03we we have figured out how to take
  • 56:05our point cloud data set and convert
  • 56:08it into a topological barcode,
  • 56:11which we can represent
  • 56:13as a persistence diagram.
  • 56:14So the next thing we want to do is we
  • 56:17want to compare two different data sets.
  • 56:20And so comparing two different
  • 56:22point clouds can be quite tricky.
  • 56:24You don't really know like how to
  • 56:27distinguish a torus from a sphere
  • 56:30from some other blobby thing.
  • 56:32And so one way in which you can
  • 56:35compare these kinds of data to
  • 56:37different point clouds is by instead
  • 56:40comparing their persistence diagrams.
  • 56:43And so there are multiple
  • 56:46techniques to compute distances
  • 56:48between persistence diagrams.
  • 56:51And one of those techniques is what's
  • 56:53called the bottleneck distance.
  • 56:55And So what happens in the bottleneck
  • 56:58distance is that you paired up.
  • 57:01And again,
  • 57:02I should apologize here because I'm
  • 57:04using color slightly differently here.
  • 57:06So the the blue colored dots are
  • 57:09from first persistence diagram,
  • 57:11diagram one and the red
  • 57:13squares are from diagram 2.
  • 57:16And so we want to compare
  • 57:17diagram 1 to diagram 2.
  • 57:19And the way we do that is by first
  • 57:22matching features in diagram
  • 57:231 to features in diagram 2,
  • 57:26where we also allow to allow ourselves to
  • 57:29map certain features to the diagonal itself.
  • 57:32So that's a matching process that happens.
  • 57:35And once you have matched the features
  • 57:38to each other or to the diagonal,
  • 57:40you find the two paired features that
  • 57:43are furthest away from each other and
  • 57:46you compute this distance between them.
  • 57:48This is called the bottleneck distance,
  • 57:51and there's ways to represent
  • 57:53that mathematically.
  • 57:54Here that's not so important.
  • 57:56The intuition is probably
  • 57:58what's most important,
  • 57:59and there is a very important theorem
  • 58:02in the field that guarantees stability
  • 58:05is called the stability theorem.
  • 58:08And what it says is that if I have a
  • 58:11point cloud X and a point cloud Y,
  • 58:14if my point cloud X is just a slight
  • 58:16perturbation of point cloud Y.
  • 58:18So I've just moved the points
  • 58:20around a little bit.
  • 58:21Then the bottleneck distance between
  • 58:24the persistence diagrams computed
  • 58:26through the Beatrice ribs complex of
  • 58:28X and the Beatrice ribs complex of Y.
  • 58:31This bottleneck distance is
  • 58:33guaranteed to be small because if
  • 58:36X is slightly different from Y,
  • 58:39the right hand side of this
  • 58:41equation is going to be close to 0,
  • 58:42and therefore the bottleneck distance
  • 58:44is going to be very very small.
  • 58:47So this basically guarantees the fact
  • 58:49that if you have one point cloud,
  • 58:51maybe points arranged as a circle
  • 58:53and you tweak the point slightly,
  • 58:55so you add a little bit of noise
  • 58:56to those points,
  • 58:57then the bottleneck distance is
  • 58:59not going to change much.
  • 59:01So it means that this bottleneck distance
  • 59:04is a stable way of comparing point clouds.
  • 59:07Again, I don't care so much about the math.
  • 59:09That's kind of the, the,
  • 59:10the main result is that topology is,
  • 59:13is able,
  • 59:15it's,
  • 59:15it's robust to these kinds of
  • 59:18noise and perturbations in
  • 59:20our data. Now one of the problems
  • 59:22with the ball neck distance is that
  • 59:25we are doing all this matching,
  • 59:26but ultimately we are only really
  • 59:28looking at the distance between points
  • 59:30that are matched but are furthest away
  • 59:33from each other and we're ignoring
  • 59:35all the other points that got matched.
  • 59:37So maybe you want to be more sensitive
  • 59:40to how well the matching works and
  • 59:43and the way to actually use that
  • 59:46information is to compute what's
  • 59:48called the washer steam distance,
  • 59:50where you perform the matching process 1st.
  • 59:54And then you compute this washer steam
  • 59:57distance between diagram one and diagram 2,
  • 59:59where you sum over the distance between
  • 01:00:02all the points that are matched to
  • 01:00:05each other and you ignore the points
  • 01:00:07that get matched to the diagonal.
  • 01:00:08So this is a much more even more stable
  • 01:00:12way of comparing 2 persistence diagrams.
  • 01:00:16And it also has similar stability
  • 01:00:18properties that I was talking about
  • 01:00:21previously in the Washington steam distance.
  • 01:00:24If you have,
  • 01:00:24if you have any kind of experience
  • 01:00:27with like optimal transport theory
  • 01:00:29or like statistics,
  • 01:00:30then I just wanted to highlight that
  • 01:00:33the washer steam distance that we're
  • 01:00:35talking about here is similar to,
  • 01:00:37it's actually exactly the same as
  • 01:00:39the washer steam distance that you
  • 01:00:40would be familiar with in the sense
  • 01:00:42that you have a transport map and
  • 01:00:44you're kind of moving mounds of
  • 01:00:46earth from one place to another.
  • 01:00:48Or the Washington steam distance that you
  • 01:00:50used to compare to probability distributions.
  • 01:00:52You can think about these topological
  • 01:00:54features as like probability distributions
  • 01:00:56and you're learning to map 1 to the other.
  • 01:00:58So this is just an aside for folks who might
  • 01:01:02be more familiar with optimal transport.
  • 01:01:05OK,
  • 01:01:07I wanted to put this like review
  • 01:01:10article in because I have talked
  • 01:01:13so far about some about extracting
  • 01:01:16topological features from point cloud
  • 01:01:18data set and then learning how to
  • 01:01:21interpret those topological features
  • 01:01:23and to compare two different data
  • 01:01:26sets by computing the bottleneck
  • 01:01:28distance or the washer steam distance
  • 01:01:31between their topological features.
  • 01:01:33But it is possible to also compute
  • 01:01:37to use topology in a different way,
  • 01:01:39where you use topology to inform the
  • 01:01:43training of a machine learning architecture.
  • 01:01:47And so for folks who are familiar
  • 01:01:49with machine learning,
  • 01:01:50I just wanted to point out that
  • 01:01:53there are ways in which you can use
  • 01:01:55topology inside the loss function
  • 01:01:57of your neural network.
  • 01:01:59So you can have a topology informed loss
  • 01:02:02and a good example of this that is a
  • 01:02:05paper by Channel in 2019 that I can point to.
  • 01:02:08You can also in machine learning
  • 01:02:11use topology to compare two
  • 01:02:13different model architectures.
  • 01:02:15One way of doing this would be to
  • 01:02:17have two different like machine
  • 01:02:19learning architectures where you
  • 01:02:20look at the activations in all
  • 01:02:23the layers of those architectures,
  • 01:02:25treat that as a point cloud and compare
  • 01:02:27them against each other using Washington
  • 01:02:30distance of their persistence features.
  • 01:02:32And that's in a good example of that is Zo
  • 01:02:34ET al. In 2021.
  • 01:02:36And then the what I'm talking about,
  • 01:02:39which is similar to this paper in 2017,
  • 01:02:41is to actually just take your data and
  • 01:02:44use topology to featurize that data,
  • 01:02:47which is to extract topological
  • 01:02:49features of that data,
  • 01:02:50learn to interpret those
  • 01:02:52topological features,
  • 01:02:53and then perhaps pass them into
  • 01:02:55some machine learning framework
  • 01:02:57to generate some kind of output.
  • 01:02:59So there are different places in machine
  • 01:03:01learning that where one can use topology.
  • 01:03:04In our case,
  • 01:03:05we are going to focus on ways in
  • 01:03:07which we use topological features
  • 01:03:09extracted from our data and pass
  • 01:03:12them into machine learning in order
  • 01:03:14to do some kind of downstream task.
  • 01:03:17OK, so next I wanted to cover a few ways
  • 01:03:21of taking these topological features
  • 01:03:25and converting them into summaries.
  • 01:03:28And the reason for doing that
  • 01:03:30is because we want to use these
  • 01:03:33topological features as input for
  • 01:03:35machine learning down the line.
  • 01:03:37And these diagrams that I've
  • 01:03:39been drawing for you so far,
  • 01:03:41they're easy to draw on a screen,
  • 01:03:43but they're not really great for
  • 01:03:45machine learning because if you
  • 01:03:47have a bunch of different data sets,
  • 01:03:49you're going to get different number of
  • 01:03:52topological features for each data set.
  • 01:03:54And you don't really know a way of
  • 01:03:56like converting this into something
  • 01:03:58that can go into machine learning.
  • 01:03:59So folks have found various ways of
  • 01:04:03taking these persistence diagrams
  • 01:04:04and converting them into even more
  • 01:04:08convenient representations that can be
  • 01:04:10used for either with medical analysis,
  • 01:04:13but more importantly for machine
  • 01:04:14learning down the road.
  • 01:04:16And one such representation is
  • 01:04:18called the persistence landscape,
  • 01:04:20where you are taking a diagram like this
  • 01:04:23and converting that into a function.
  • 01:04:25And the way you do that is,
  • 01:04:26is quite simple really.
  • 01:04:28You take each point and you draw a
  • 01:04:3110th function based off of that point
  • 01:04:33by connecting it to its X coordinate
  • 01:04:36and connecting it to its Y coordinate
  • 01:04:39intersected with the diagonal.
  • 01:04:41It takes more words to describe,
  • 01:04:43so you just you can just simply
  • 01:04:45see it from this picture.
  • 01:04:47You draw this little tent function
  • 01:04:50and then tilt the diagram by 45° and
  • 01:04:53there you end up getting this function
  • 01:04:57representation of your persistence diagram.
  • 01:05:00You can now treat this as a function,
  • 01:05:02and you can use tools from functional
  • 01:05:06analysis to analyze this persistence diagram.
  • 01:05:09Again,
  • 01:05:09you can kind of formalize this with a
  • 01:05:12bunch of math by drawing out what A10
  • 01:05:15function looks like and how you take
  • 01:05:17a diagram and convert it into a function,
  • 01:05:21but this is all just notation.
  • 01:05:25I simply want to convey to you
  • 01:05:27the intuition behind taking a
  • 01:05:29diagram and converting it into a
  • 01:05:32function for downstream analysis.
  • 01:05:34There's some important reasons why one
  • 01:05:36might want to convert this into a function.
  • 01:05:39One of them is that you can use
  • 01:05:41tools from functional analysis.
  • 01:05:43Another thing that's important is
  • 01:05:45that this is an injective mapping,
  • 01:05:48and it satisfies the same properties
  • 01:05:50that persistence diagrams satisfy.
  • 01:05:54Another convenient way of converting
  • 01:05:57persistence diagrams into something
  • 01:05:59that's useful for machine learning is
  • 01:06:02to convert the diagram into an image.
  • 01:06:05The reason you might want to do this is
  • 01:06:07because we have architectures that are very,
  • 01:06:10very good at dealing with images.
  • 01:06:12We know how to classify
  • 01:06:14cats and dogs and horses.
  • 01:06:16We also know how to generate images.
  • 01:06:18So we can take advantage of all the tools
  • 01:06:20we have developed for dealing with images
  • 01:06:23in machine learning if we can convert
  • 01:06:25persistence diagrams into an image.
  • 01:06:27And the way one goes about doing that is
  • 01:06:30you take your input persistence diagram,
  • 01:06:33you tilt it again by 45°.
  • 01:06:36So now you're measuring the birth
  • 01:06:38coordinate and you're measuring
  • 01:06:39distance from the diagonal, which we
  • 01:06:41call persistence on the Y coordinate.
  • 01:06:44So nothing fancy,
  • 01:06:45just kind of tilting the diagram.
  • 01:06:47Then what you do is at each point in
  • 01:06:51the diagram, you drop a Gaussian,
  • 01:06:54so like a 2D Gaussian,
  • 01:06:56and you weigh the Gaussians by
  • 01:06:58distance away from the X axis.
  • 01:07:01So points that are higher
  • 01:07:03up get a brighter Gaussian.
  • 01:07:04The points that are lower down,
  • 01:07:06they get a lower amplitude Gaussian.
  • 01:07:09Again,
  • 01:07:09the rationale for doing that is that
  • 01:07:11points that are further away are points
  • 01:07:14that are more topologically significant.
  • 01:07:16Points that are closed are kind of
  • 01:07:18derived from some noise in our data,
  • 01:07:20and we want to be robust to noise.
  • 01:07:22So it makes sense to weigh things
  • 01:07:24by distance away from the diagonal.
  • 01:07:26This is called a persistence image.
  • 01:07:28This is still a continuous object.
  • 01:07:30And So what you can do then is you
  • 01:07:33can take this surface and you can
  • 01:07:35just divide it into smaller pixels
  • 01:07:38and convert this into an image format.
  • 01:07:41And once you have this in an image format,
  • 01:07:43you can use convolutional neural
  • 01:07:45networks and other kinds of like
  • 01:07:48generative AI tools to take advantage
  • 01:07:50of like all of those tools to to
  • 01:07:53work with these persistence images.
  • 01:07:56Again,
  • 01:07:56there's a bunch of math that one
  • 01:07:58can write down to kind of formally
  • 01:08:00describe this process,
  • 01:08:01but I think the visuals do a much better job.
  • 01:08:06Finally, you can convert your
  • 01:08:08persistence diagrams into what are
  • 01:08:11called smooth persistence curves.
  • 01:08:13And the way this works is
  • 01:08:16you walk along the diagonal,
  • 01:08:18and as you're walking along the diagonal
  • 01:08:21from the bottom left to the top right,
  • 01:08:23you look at a window.
  • 01:08:25And you construct the
  • 01:08:26window by looking at this,
  • 01:08:28like this rectangular section
  • 01:08:30that's to the top left of
  • 01:08:33wherever you are on the diagonal.
  • 01:08:36And what you do is you compute some
  • 01:08:38kind of statistic of points that
  • 01:08:41exist within this little window.
  • 01:08:42So one simple statistic would be
  • 01:08:45simply counting the number of points
  • 01:08:47that exist within this window.
  • 01:08:49So that allows you to construct a
  • 01:08:52function as you're walking from
  • 01:08:55left to right, construct a curve,
  • 01:08:58a continuous curve,
  • 01:08:59which you can then analyze.
  • 01:09:03I don't have the curve here for some reason,
  • 01:09:06but you can imagine like as
  • 01:09:07you're walking along the diagonal,
  • 01:09:08just counting how many objects
  • 01:09:10exist within this window over time
  • 01:09:12gives you a continuous curve,
  • 01:09:14which you can describe, but medically
  • 01:09:15you can prove things about that curve.
  • 01:09:17And one of the reasons why you might
  • 01:09:19want to use this curve is that there
  • 01:09:21are ways to speed this process up a lot
  • 01:09:24because these are all like Gaussians.
  • 01:09:26And so there's ways to compute
  • 01:09:28this curve very, very fast.
  • 01:09:30So that's also helpful for
  • 01:09:33machine learning purposes.
  • 01:09:35OK, Finally I just wanted to give you
  • 01:09:38some sense of how you do topology for
  • 01:09:42data that is not point cloud based data.
  • 01:09:45What if you have like images
  • 01:09:47that you want to work with?
  • 01:09:50You can compute topology
  • 01:09:52directly from images.
  • 01:09:54So think of an image as nothing
  • 01:09:56but a matrix of values, right?
  • 01:09:58And the values are going to depict
  • 01:10:00how bright a given pixel is.
  • 01:10:02So if I have one here,
  • 01:10:03that pixel is quite dark,
  • 01:10:055 is brighter, three.
  • 01:10:07Sorry,
  • 01:10:08this is not really the color here
  • 01:10:10doesn't really correspond to the value,
  • 01:10:12but five would be a brighter pixel.
  • 01:10:14Three would be slightly dimmer than five,
  • 01:10:16but brighter than one.
  • 01:10:17So you can think of an image as nothing
  • 01:10:20but a matrix of image intensity values.
  • 01:10:22And you can perform the same kind of
  • 01:10:26filtration that we did previously by
  • 01:10:28expanding that epsilon radius disk by
  • 01:10:31going through this matrix of values
  • 01:10:34and simply deleting everything that's
  • 01:10:37above a value or below a value.
  • 01:10:39And these are called sub level set
  • 01:10:41or super level set filtrations,
  • 01:10:43where in this example,
  • 01:10:45any value that's only values that are
  • 01:10:48less than or equal to one are shown
  • 01:10:50where we spot two holes in our data.
  • 01:10:52So five and three become holes.
  • 01:10:55Then you increase your threshold to three.
  • 01:10:57So now three gets filled in,
  • 01:10:59but there's only one hole in the data set.
  • 01:11:01And then as you increase your
  • 01:11:03threshold to five is then both
  • 01:11:05of those holes get filled in.
  • 01:11:07So when you're working with images,
  • 01:11:09you can construct what are
  • 01:11:12called cubicle complexes,
  • 01:11:13where you define a threshold
  • 01:11:15value for your image,
  • 01:11:16and by applying that threshold you
  • 01:11:19can count in your pixels how many
  • 01:11:22holes exist and you can quantify the
  • 01:11:25shape of an image in that manner.
  • 01:11:27I'll show you how this can be used in a
  • 01:11:30very powerful way in a subsequent workshop.
  • 01:11:33I just want to point to the paper here.
  • 01:11:37So there's a nudist paper came out
  • 01:11:39in 2020 by Bastian Riek and and
  • 01:11:42other folks in the topology community
  • 01:11:45where they took F MRI images,
  • 01:11:47which were volumetric F MRI images.
  • 01:11:50So there's lots and lots of data.
  • 01:11:52And they performed this cubicle
  • 01:11:54complex filtration of these
  • 01:11:56volumetric images to construct a
  • 01:11:59sequence of persistence diagrams,
  • 01:12:01which they converted into persistence images.
  • 01:12:04And from these persistence images,
  • 01:12:06they were able to use many
  • 01:12:08for learning techniques to
  • 01:12:09categorize different
  • 01:12:10brain state trajectories.
  • 01:12:12So this is a very cool paper that
  • 01:12:14combines a lot of the stuff that we
  • 01:12:17talked about of going from images to
  • 01:12:20persistence diagrams to persistence
  • 01:12:21images and those images being used
  • 01:12:24as input for machine learning to
  • 01:12:26classify brain state trajectories.
  • 01:12:29You can also take directly
  • 01:12:31the persistence diagram,
  • 01:12:32compute summaries of the persistence
  • 01:12:34diagram such as persistence
  • 01:12:36landscapes and persistence curves.
  • 01:12:38This is a persistence curve here,
  • 01:12:40and you can use those persistence
  • 01:12:42curves directly to perform regression
  • 01:12:44tasks such as estimating the severity
  • 01:12:46of the disease in these F MRI images.
  • 01:12:51Lastly, something that's going to be
  • 01:12:54highly relevant to us going forward
  • 01:12:56is doing TDA on time series data.
  • 01:13:00So here's an example of
  • 01:13:02a time series data set.
  • 01:13:04These are just two sinusoidal
  • 01:13:06curves, F1 and F2.
  • 01:13:08You can see F1 has a higher amplitude,
  • 01:13:11F2 has a smaller amplitude over time.
  • 01:13:15And So what you can do is you
  • 01:13:16can plot them against each other.
  • 01:13:18So you can plot F1 against F2.
  • 01:13:21And this is one way of
  • 01:13:23taking time series data.
  • 01:13:24You have to discretize in time of course,
  • 01:13:26and converting them into
  • 01:13:28a point cloud data set.
  • 01:13:30And you can compute topology directly
  • 01:13:32from this point cloud data set.
  • 01:13:34So this works when you have two
  • 01:13:36time series data sets, F1 and F2.
  • 01:13:38You can convert that into a point cloud.
  • 01:13:41If you have just one time series data set,
  • 01:13:44what you can do is you can do a
  • 01:13:46sliding window transformation.
  • 01:13:48So you take a small sliding window,
  • 01:13:50so a small chunk of the data,
  • 01:13:52move that window forward 1 by 1 by 1.
  • 01:13:55And for within that window,
  • 01:13:57you can construct this phase portrait or
  • 01:14:00this time delay embedding as it's called.
  • 01:14:02And then you can take that loop and you
  • 01:14:05can convert that into a persistence diagram.
  • 01:14:08So I wanted to show you some
  • 01:14:11examples of kind of doing this,
  • 01:14:13right.
  • 01:14:14So this is combining both cubicle
  • 01:14:17homology and this time delay embedding
  • 01:14:20the sliding window embedding and using
  • 01:14:23that to compute persistence diagrams.
  • 01:14:27And so again, I'm not going to
  • 01:14:28go through all the details here,
  • 01:14:30but I just wanted to show you an
  • 01:14:32an application of this in practice.
  • 01:14:34So this is a paper which was
  • 01:14:36on archive in 2018.
  • 01:14:37It might be out already.
  • 01:14:39I hope it's out by now.
  • 01:14:40And in the, in this data set,
  • 01:14:42they were looking,
  • 01:14:44they were imaging the vocal
  • 01:14:46cords of humans as they were,
  • 01:14:49they were making some sounds.
  • 01:14:52And so when you're making like
  • 01:14:54a rhythmic pattern of sounds,
  • 01:14:56your vocal cords, they open, they close,
  • 01:14:58they open and then they close.
  • 01:15:00And so in this data set,
  • 01:15:03obviously there's a periodic nature to
  • 01:15:04the type of sound you're producing.
  • 01:15:07And so if you look at these
  • 01:15:09images of the vocal cords and you
  • 01:15:11compute image self similarity,
  • 01:15:12you can kind of guess that there
  • 01:15:14is a period after which the
  • 01:15:17image becomes similar to itself.
  • 01:15:18And so you can kind of quantify the
  • 01:15:21periodicity of the data in this way.
  • 01:15:23But also,
  • 01:15:23if you take the sequence of images
  • 01:15:26and you do this time delay embedding
  • 01:15:29and compute cubicle homology,
  • 01:15:30you can end up with a persistence
  • 01:15:33diagram where very clearly you
  • 01:15:35see this H1 feature which tells
  • 01:15:37you there's a loop in your data,
  • 01:15:39which means that the data is periodic.
  • 01:15:42What's cooler is when you do
  • 01:15:45what's called bifonation.
  • 01:15:46So in bifonation,
  • 01:15:47the vocal cords move in a way
  • 01:15:50that they produce two different
  • 01:15:52frequencies at the same time.
  • 01:15:54So you have like a high frequency,
  • 01:15:57a sound coming out,
  • 01:15:58and at the same time you have like
  • 01:16:01a low frequency whine coming out.
  • 01:16:03And I was thinking if I can do a
  • 01:16:05demonstration of this kind of voice,
  • 01:16:06but I really cannot do it.
  • 01:16:08So you have to.
  • 01:16:09If you search online for Bifor Nation,
  • 01:16:11you will find examples of people who
  • 01:16:13can produce both high frequencies
  • 01:16:15and low frequencies at the same time.
  • 01:16:17And if you look at the vocal cord
  • 01:16:20images of producing this kind of sound,
  • 01:16:22this is kind of what it looks like
  • 01:16:24when you look at self similarity.
  • 01:16:25You do observe a pattern here.
  • 01:16:28But very importantly,
  • 01:16:29when you take this data and you plug it
  • 01:16:32through the techniques that I've described,
  • 01:16:34you get a persistence
  • 01:16:36diagram that looks like this,
  • 01:16:38which has 2H1 features and one H2 feature.
  • 01:16:42So remember,
  • 01:16:44H1 is is the dimension 1 hole and
  • 01:16:48H2 is dimension 2 hole or a void.
  • 01:16:51And so from our previous quiz,
  • 01:16:53hopefully you recall that 2H1 holes and
  • 01:16:56one H2 hole means it's like a torus,
  • 01:16:59which means there's empty hole in the
  • 01:17:02middle and there are two loops in the torus.
  • 01:17:04And that makes perfect sense for
  • 01:17:06this data set because you have a
  • 01:17:09high frequency and a low frequency
  • 01:17:10forming 2 loops here.
  • 01:17:12And then you have because it's
  • 01:17:13arranged like a Taurus,
  • 01:17:14both of those things are happening
  • 01:17:16at the same time.
  • 01:17:17You get a dimension to all in the
  • 01:17:19data set or as it says in here,
  • 01:17:21to a two cycle in the data set.
  • 01:17:25And this again is from the same paper.
  • 01:17:27This is an example where the person is
  • 01:17:31showing irregular vocal fold vibrations.
  • 01:17:34So there is no periodicity,
  • 01:17:36no quasi periodicity.
  • 01:17:38It appears random.
  • 01:17:40When you look at image self similarity,
  • 01:17:42it just goes along the diagonal.
  • 01:17:44You don't see a lot of like important
  • 01:17:47self similarity off the diagonal,
  • 01:17:49which means that all of these images
  • 01:17:51look kind of different from each other.
  • 01:17:53And if you throw this into TDA,
  • 01:17:55you get topological features that
  • 01:17:57are very close to the diagonal.
  • 01:18:00Again,
  • 01:18:01you can compute like confidence
  • 01:18:03intervals and so forth for these things.
  • 01:18:06But again,
  • 01:18:07it shows there's no interesting
  • 01:18:09topology happening in this data
  • 01:18:11set because it's irregular.
  • 01:18:12There's no quasi periodicity
  • 01:18:14or periodicity in this data.
  • 01:18:19Lastly, topology is invertible
  • 01:18:22to a certain extent.
  • 01:18:23So folks often ask like, OK,
  • 01:18:26I created this persistence diagram.
  • 01:18:28I want to interpret the where these
  • 01:18:31topological features come from.
  • 01:18:33And you can do that using something
  • 01:18:36called cycle representatives.
  • 01:18:37And what cycle representatives
  • 01:18:39allow you to do is they allow you to
  • 01:18:42interrogate a specific topological
  • 01:18:44feature and ask the question,
  • 01:18:46where does that feature come
  • 01:18:47from in your input data set?
  • 01:18:50So for example,
  • 01:18:51if we have this persistence diagram that's
  • 01:18:53derived from this point cloud data set,
  • 01:18:55you can then interrogate this
  • 01:18:57topological feature and this it
  • 01:19:00will tell you that this dimension
  • 01:19:020 feature appears here because
  • 01:19:04these two clusters of data became
  • 01:19:07connected at that epsilon value.
  • 01:19:09So the these two connected
  • 01:19:11components disappeared,
  • 01:19:12they merged together at that epsilon value.
  • 01:19:16And likewise for dimension one,
  • 01:19:18you can interrogate that topological
  • 01:19:20feature and it will tell you
  • 01:19:22that that particular loop is
  • 01:19:24formed by these four points.
  • 01:19:26This is very,
  • 01:19:27very important for us because when we
  • 01:19:29are dealing with the state space of
  • 01:19:32cellular activity and neural activity,
  • 01:19:34having access to these cycle
  • 01:19:37representatives will allow us to,
  • 01:19:39will will give us the ability to
  • 01:19:41say which time points and which
  • 01:19:44parts of the brain precisely led
  • 01:19:46to the formation of a cycle which
  • 01:19:49indicates periodic activity.
  • 01:19:51So we can indeed go back in reverse
  • 01:19:54from topological features back to
  • 01:19:57our original data set and figure
  • 01:19:59out why a certain topological
  • 01:20:02feature exists in our data.
  • 01:20:04And so I think we are at a stage here
  • 01:20:07where we don't have a lot of time left.
  • 01:20:10We're supposed to end at 5:30.
  • 01:20:12So I'm not going to go through the ML parts.
  • 01:20:14I had a few ML slides,
  • 01:20:16but I think we can punt that to
  • 01:20:18our third workshop.
  • 01:20:19I think it would be a good time
  • 01:20:21to take questions and end here.
  • 01:20:23I just wanted to mention if you're,
  • 01:20:24if you're about to leave,
  • 01:20:25we're going to have another workshop
  • 01:20:28next week that will be given by
  • 01:20:30Rahul and he'll be telling you how
  • 01:20:33we can take a graph consisting of
  • 01:20:35nodes and edges and use graph signal
  • 01:20:38processing to quantify how some
  • 01:20:40signal is distributed on that graph.
  • 01:20:43And then the following week,
  • 01:20:45me and Brian,
  • 01:20:46we're going to put all of
  • 01:20:48these things together,
  • 01:20:49talk about GSTH as a technique
  • 01:20:52all combined together and we'll
  • 01:20:54show you how we have used this
  • 01:20:56technique with some of our data sets.
  • 01:20:59Thanks for listening.
  • 01:21:00Thanks for coming.
  • 01:21:01Jay,
  • 01:21:02I have a question.
  • 01:21:03It may be a little bit premature,
  • 01:21:05but is my intuition just
  • 01:21:07correct my intuition?
  • 01:21:09Do you understand correctly that
  • 01:21:11if you have more organised,
  • 01:21:13more complexly organized systems,
  • 01:21:16you should see higher level holes?
  • 01:21:21And if it's mostly random noise,
  • 01:21:24you kind of don't really see much? Yeah.
  • 01:21:27So if you have random noise, then in space,
  • 01:21:30everything will get filled in, right?
  • 01:21:33It's all just noise.
  • 01:21:34So there'll be no structure to the data set
  • 01:21:37and you won't see any holes in the data.
  • 01:21:40The connectivity pattern
  • 01:21:41also looks different.
  • 01:21:42So you can do statistical tests where you
  • 01:21:45can take real data from an experiment and
  • 01:21:49compare that with topological features
  • 01:21:51derived from like standard distributions,
  • 01:21:54like uniform distribution
  • 01:21:55and Gaussian distribution,
  • 01:21:56and it will tell you that in your
  • 01:21:58experiment in the state space,
  • 01:22:00it kind of looks like a uniform distribution.
  • 01:22:02There's no structure to it. Yeah.
  • 01:22:04So that's that's one aspect.
  • 01:22:06Also like dimension zero will tell you
  • 01:22:08if in your trajectory here you have two
  • 01:22:11different connected components, right.
  • 01:22:13So if you have like one like set
  • 01:22:15of States and then a completely
  • 01:22:17different set of States and they're
  • 01:22:19kind of far apart from each other.
  • 01:22:21That's what we learned from like
  • 01:22:23dimension 0 homology in addition to like,
  • 01:22:25you know,
  • 01:22:26noise and how the data is distributed.
  • 01:22:28And dimension one will tell us these
  • 01:22:30periodic loop like structures that
  • 01:22:32might exist in our data and also the
  • 01:22:34empty spaces being states that cannot
  • 01:22:37really exist based off of like this
  • 01:22:39this experimental data again being
  • 01:22:41cognizant of the earlier question where
  • 01:22:43if your data is not sampled correctly,
  • 01:22:45it might be telling you the wrong thing.
  • 01:22:49OK, thank you. Do you have any other
  • 01:22:54questions in the chat or if anybody
  • 01:22:56wants to ask any follow up questions?
  • 01:23:01All right, I don't see any questions.
  • 01:23:03Thank you so much.
  • 01:23:05Just a note, all these papers that
  • 01:23:08you mentioned in the presentation
  • 01:23:09I'm going to add to your website.
  • 01:23:12So if people are interested
  • 01:23:14in looking at those papers,
  • 01:23:16there will be links on your maps site.
  • 01:23:21And thank you again.
  • 01:23:22And I'll see you next week. See
  • 01:23:24you next week. See you. Bye.