Fantastic plots and how to draw them
October 29, 2020Toma Tebaldi
Associate Research Scientist, Section of Hematology, Yale Cancer Center, Yale School of Medicine
YCCEH Seminar
October 15, 2020
Information
- ID
- 5825
- To Cite
- DCA Citation Guide
Transcript
- 00:00Topic of today's is about speaking
- 00:04about data visualization.
- 00:05And so it will be very in general on
- 00:09ramps how to design some strategy,
- 00:12some issues, some principles to guide in
- 00:15the visualization of the of our data.
- 00:20So data visualization is important to
- 00:22explore the data and this is particularly
- 00:25crucial since nowadays data are becoming
- 00:28much more complex and much more bigger,
- 00:31and so in general there
- 00:33is a rise of data science.
- 00:35So not only in research,
- 00:37not only in biological research.
- 00:40The second function to data
- 00:43visualization for data visualization
- 00:45is to communicate the data and
- 00:47that may be the most traditional.
- 00:50I'm so that's what that.
- 00:54Creating publication figures,
- 00:55for example,
- 00:56is about to communicate data to others
- 01:01because communicating data visually is
- 01:04more efficient than than words in general.
- 01:09So in order to represent complex data here,
- 01:13I collected 3.
- 01:16General challenges and aims.
- 01:18So whenever you plot the data is
- 01:22important that the plots are and
- 01:24the representations are are precise,
- 01:27so they're truthful.
- 01:28That means that distortion has
- 01:31to be avoided as much as possible
- 01:34is not always achievable,
- 01:36so distortion sometimes is unavoidable.
- 01:39Think about for example,
- 01:41when you plot the 2D Maps
- 01:44for representing 3D data.
- 01:46But the point is that the
- 01:48distortion doesn't have to convey
- 01:49the message of the figure,
- 01:51so it has to be something that is not
- 01:54related to the main message of the feature.
- 01:57Otherwise it's a problem.
- 01:58Then the second point is clarity.
- 02:01So data the figure has not to be ambiguous,
- 02:06and the third one is the efficiency.
- 02:09So every.
- 02:11Inca every in every pixel is precious,
- 02:14so each decision in doing your plotter,
- 02:17each decision on the color on the size
- 02:20on the number of layers said that you
- 02:24that you plotter is important and it has an.
- 02:28Everything has to be has to have a purpose,
- 02:32so you should reduce the
- 02:34so called chartjunk here.
- 02:36Below the slide you see
- 02:38quotation from Edward.
- 02:39After that I discovered by the way.
- 02:43Only yesterday is that he
- 02:45never knew I'm I'm here from.
- 02:48Since three years and they
- 02:50never known that Edward Tufte,
- 02:52it is the most one of the
- 02:56most celebrated visualization.
- 02:58Antisa is is is there in new heaven.
- 03:03And the condition is that with an
- 03:05image you have to give to the viewer
- 03:07the greatest number of ideas in the
- 03:09shortest time and with the least possible,
- 03:11Inc.
- 03:13This is another general representation
- 03:15of his that you should consider
- 03:18to make a good visualization.
- 03:20Also, this is very general, so it's not.
- 03:23It's not only about science
- 03:25and basically they criteria are
- 03:27organized in four different sets,
- 03:29so you need to represent the information,
- 03:32but the fever also need to display
- 03:34to convey to communicate a story and
- 03:37that's the concept of the figure.
- 03:40This is connected also with the goal of the.
- 03:43All of the figures, so the function.
- 03:47So what is the message that you
- 03:50want to display and also the visual
- 03:53format is important.
- 03:55Obviously the weight of these four
- 03:58different layers is different in
- 04:00different applications for images,
- 04:02so visual form probably is more
- 04:05important for artistic display,
- 04:07while for scientific displays that
- 04:09probably in formation goal and
- 04:12story are more important.
- 04:14This doesn't mean that you should not
- 04:17consider also the visual visual part.
- 04:20I ideally the perfect visualization
- 04:22is at the center of these four steps.
- 04:28So this is for the introduction now.
- 04:31The rest of the presentation will be
- 04:33structured with some very concrete examples,
- 04:35and it's also organized in
- 04:37a way that is interactive,
- 04:39so I will show something.
- 04:41Some example of figures and I will try to
- 04:44ask you what could be wrong with this figure.
- 04:48Starting with this,
- 04:49but this is a figure that is very
- 04:52frequent in scientific publication.
- 04:55It's a barplot and it's the most.
- 04:58It's actually the most frequent disk image
- 05:01that you can find in biomedical journals.
- 05:08Do you have any ideas of what
- 05:11could be wrong with this?
- 05:13Not pretty, yes. Lots of it.
- 05:18It. It's lacking data.
- 05:21It's it's not showing you
- 05:23the data distribution.
- 05:24Yeah, yes, exactly there are.
- 05:26It's putting the treatment on the left,
- 05:29which I always don't like.
- 05:31I always want the control on the
- 05:33left. Oh yes, OK.
- 05:35Yes, that's true, yes,
- 05:36so it has a lot of like. Capital A.
- 05:42Visual problems, but the main yes.
- 05:44The main thing is that it doesn't
- 05:47show the data, so that's the main.
- 05:49That's the main drawback of this image,
- 05:52and so, particularly in the last year.
- 05:54So that's the trend that a lot of also
- 05:57publishers are requesting in images.
- 05:59So the principle is that you need ideally to
- 06:02show always the data points in every figure,
- 06:05because you should show the data that
- 06:08make up your fingers and these for a
- 06:11barplot means that you have to show.
- 06:13They did data points.
- 06:16So here you see an example of how these
- 06:19bar plot can be represented with the
- 06:22data points and you see here on the
- 06:25on the right you see the the single
- 06:28data points and also you see a summary
- 06:30statistics that could be for example,
- 06:33they mean plus minus standard deviation
- 06:35for the treatment and the end the control.
- 06:38In general this showing barplots
- 06:40with only the mean with the standard
- 06:42deviation is a problem and there was
- 06:45a publication of five years ago.
- 06:47The teacher, wife and one issues,
- 06:49for example,
- 06:50that the different data distribution
- 06:52can lead to the same bar block.
- 06:55You see an example here.
- 06:57So in a you should use.
- 06:59You see a barplot representation
- 07:01of distribution of data and all the
- 07:04distribution that you see from B2ER
- 07:06representing could be represented
- 07:08by that padlock.
- 07:09So the ideal situation would be
- 07:11what you see here in plot.
- 07:14Be where you have data that are.
- 07:18Symmetrically distributed,
- 07:19so this is the if the distribution
- 07:21of your real data is.
- 07:23These are the bar plot is less problematic,
- 07:26but for example in C use your situation
- 07:28where you have an outlier and so
- 07:31for example this would mean that the
- 07:33supposed difference that you are
- 07:35showing in the padlock is not real,
- 07:38but it's present only because
- 07:39you have these outlier pointer.
- 07:41But most of the other data are
- 07:44overlapping in the two distributions.
- 07:46Sometimes as you see in the,
- 07:49this plot could hide some patterns in
- 07:52the data, so that's what you see here.
- 07:55In the do you see my cursor?
- 07:59Yes, OK,
- 08:00so this for example shows that there are.
- 08:05The distributions that you see
- 08:07here are by model.
- 08:08This could be linked, for example,
- 08:11to replicate for example,
- 08:12technical replicates and
- 08:13biological replicates,
- 08:14or it could be an important
- 08:16property of the data.
- 08:18Nevertheless,
- 08:18it's something that you cannot
- 08:20see if you represent with with
- 08:23a bar plot and the also Bartlett
- 08:25hide the number of data that are
- 08:27used to visualize the plot.
- 08:29The barplot themselves and so
- 08:31for example in EU situation,
- 08:33where you have an equal number of.
- 08:35Points for the black and the white are
- 08:39Bartlett on the left and the right.
- 08:42At this is a problem also when you
- 08:45want to show paired data in barplots.
- 08:49So again here you see a situation
- 08:53where a barplot
- 08:54displays some is the same.
- 08:57For situations that you see
- 08:59displayed in BC&D, so be Cmdr.
- 09:01Very different situations here
- 09:03you could imagine, for example,
- 09:04that this data obtained from single
- 09:06patients at treated with the dragon,
- 09:09and you measure a parameter of the patients,
- 09:11and so the information related to each
- 09:14patient has to be connected so that the
- 09:17meaning of the of the pair that plot.
- 09:20So the situation in B shows that the
- 09:22Dragon has a consistent effect on all
- 09:25the patients and you can see that.
- 09:28Calculating for each patient,
- 09:30the difference between the dots on
- 09:32the left and on the right give rise
- 09:35to this to this plot here below,
- 09:38where all the differences are
- 09:40positive and are also consistent in.
- 09:42See you see a situation where
- 09:44the drug has very big,
- 09:46very different effects depending
- 09:48on the patient.
- 09:50So that the distribution of
- 09:52the differences is skewed.
- 09:53And by the way,
- 09:55this line represents the median
- 09:57difference that you see for
- 09:59each patients for the treatment.
- 10:01And the third plot indeed that you see
- 10:03has a composition of effects that.
- 10:05So here you see that the again
- 10:08the difference is by model.
- 10:09That means that there are patients
- 10:11that do not respond to the dragon,
- 10:14and you see here with the
- 10:15horizontal lines and some patients
- 10:17that responded to the dragon.
- 10:19So the resulting distribution of the
- 10:21difference as you see here is by model.
- 10:24The problem with her plots
- 10:25are and the problem.
- 10:26Also,
- 10:27if you use barplots with paired
- 10:29data is that you don't see any.
- 10:31Any of this structure so
- 10:33you you are losing it.
- 10:34So the best way is always to show
- 10:36the dots are of your distribution,
- 10:38maybe together with the bar plots
- 10:40and if the data are paid also
- 10:42to show the single connection
- 10:44with in between the dots.
- 10:49There is also an issue about about the
- 10:52choice of displaying the meme of your data,
- 10:56for example versus the media,
- 10:58or to show the standard deviation
- 11:01versus the standard error of the mean,
- 11:04so mean versus median are ways to represent
- 11:07summary of the centrality of a distribution.
- 11:11An the mean is preferable if you
- 11:14suppose your data are for example.
- 11:17Symmetrically distributed.
- 11:18For example, if you assume that the data
- 11:21has a normal or Gaussian distribution,
- 11:23while the median represents the
- 11:25mid is the point that represents
- 11:28the middle of your data.
- 11:30The middle of your distribution,
- 11:32and it's more generally applied
- 11:34independently from the shaper
- 11:36of the distribution of the data.
- 11:38So here you see an example where you
- 11:41have four different samples population
- 11:44and you plot the mean plus the standard.
- 11:48And that's the most conventional
- 11:50way that you see in publication.
- 11:52They mean plus standard deviation.
- 11:55And the median of the population will
- 11:58receive the single point and the
- 12:00horizontal bar represents the median.
- 12:02So an important point about.
- 12:06Mean versus median is that the
- 12:08mean and can be used only with
- 12:11symmetrical distributions.
- 12:13Otherwise it can be misleading.
- 12:15While the median is more
- 12:17generally appropriate.
- 12:19When you have an outlier like that,
- 12:21you would always recommend the meat being.
- 12:24Honey. When you have an outlier
- 12:27like in the third group there,
- 12:29yeah, then it makes more sense to
- 12:31use the median.
- 12:33Yeah, nobody showed them young.
- 12:34Is that the median is more robust
- 12:37data with outliers is totally
- 12:38more robust with outliers,
- 12:40and the median is not,
- 12:42so the presence of over outlier as
- 12:44you see here in C can shift a lot the
- 12:48mean while the median is is affected,
- 12:50but not so much.
- 12:53Especially from the magnitude
- 12:55of the outlier,
- 12:56I would say. So
- 12:58tomorrow question right?
- 12:59So so also, being you know,
- 13:02knowing there's a difference
- 13:03between me and a medium,
- 13:05but one of the things I heard,
- 13:07of course, haven't looked
- 13:09myself into this deeply enough.
- 13:11Is that for the meeting the distribution,
- 13:13unlike mean not necessarily follows
- 13:15a Gaussian or normal distribution,
- 13:17so that from a statistical point of view
- 13:20is going to be a little hard to calculate,
- 13:23certain significance etc.
- 13:24Based on medium data.
- 13:26Is that true?
- 13:27Or it's simply a misnomer?
- 13:30How to calculate the
- 13:32significance of differences?
- 13:34That's a different.
- 13:36So that's the difference of the approach.
- 13:39If you choose parametric test,
- 13:41such as the tester or the ANOVA
- 13:44and those tests assume that the
- 13:47distribution is Goshen is normal.
- 13:49Yeah, so you need to be careful so he is
- 13:52usually if it is a repeated measures.
- 13:55So if you're testing repeated
- 13:56measure yes soon the error is is
- 13:59is distributed in a Goshen way,
- 14:01but that is not always the case.
- 14:03For example,
- 14:04if you're comparing two population of
- 14:07jeans with a signal for each gene.
- 14:09Just have to check it.
- 14:11Well, so so this is something that I
- 14:13think will be particularly important
- 14:15for experimental scientist, right?
- 14:17Because you know, as an experiment
- 14:18is when we are trained, we know OK,
- 14:21when we did design experiment,
- 14:22we do service replica so we can
- 14:24join error bar without thinking
- 14:26Y and how to deal with it.
- 14:28And if you go to a statistician
- 14:30that will tell you say oh look,
- 14:32if you're going to use the test,
- 14:35you have to show me first that this is
- 14:37actually largely a normal distribution
- 14:39before you can actually use the T test.
- 14:42Whereas the vast majority
- 14:43of people in the lab,
- 14:45that's not how they will
- 14:46think about in the 1st place,
- 14:49and they also not trendy enough
- 14:51to think you know how to prove
- 14:53or disprove that's the case.
- 14:55So what would you suggest,
- 14:57especially when we're doing
- 14:58experiment that you cannot do
- 15:00200 replicas for each experiment.
- 15:02So what would be a good
- 15:04approach in that regard?
- 15:06Yeah, so there is a tradeoff between
- 15:09the ideal situation where the ideal
- 15:11situation would be always to have
- 15:14enough data points so that you can
- 15:16understand the shape of the distribution
- 15:19and the real case scenario with you
- 15:22can do as many replicates as you can,
- 15:24and so usually you have to assume
- 15:27that the distribution is normal, so.
- 15:33Ideally, you should always check her.
- 15:36And again, if we are repeating measures
- 15:39and you are collecting a measure
- 15:41of the same data in a repeated,
- 15:44that way you can assume that
- 15:46if the error is stochastic,
- 15:48it should be normally distributed.
- 15:50So you assume that the distribution of
- 15:52the error is Goshen, and it makes sense.
- 15:55But for example in other situation
- 15:57where you have a lot of measurements
- 16:00and measurements of different entities,
- 16:02for example, the expression of
- 16:04different genes we're doing like.
- 16:07A compilation of Jesus.
- 16:08Then these assumption is less probable,
- 16:11is less likely,
- 16:12and you should have enough data points
- 16:15so that you can switch from parametric
- 16:18tests one on parametric, so we're not.
- 16:23That doesn't make assumption of
- 16:25on the underlying distribution.
- 16:26It is, for example,
- 16:27they will cook some test or the
- 16:30Mann Whitney test.
- 16:31And the problem is that you need them
- 16:34or replicates because if the end is
- 16:36the size is less than five, you don't.
- 16:39You cannot reach the statistical
- 16:43significance as it is accepted below 0.05,
- 16:47but it's usually the more correct way.
- 16:53Then they the standard is not to use that,
- 16:55and so I remember there was a case
- 16:57where the paper was in review.
- 16:59It was from.
- 17:02Young bean and I remember we performed
- 17:05the Wilcoxon test and the reviewers
- 17:07as to why we didn't do the parameter
- 17:10test so so they asked for the opposite.
- 17:13They asked us to go against the
- 17:15ideal situation.
- 17:18I think this is very helpful.
- 17:20I think it's really,
- 17:21you know telling about me,
- 17:22especially for people who
- 17:24are not familiar with with.
- 17:26Test and also the the World Cup test.
- 17:29I think it's really suggest
- 17:30you to look into that.
- 17:32Things can be very helpful.
- 17:33Yeah, and and obviously you're
- 17:35so it's important when you plan.
- 17:37If you if you can to have enough data
- 17:40points to perform a nonparametric test.
- 17:43In high throughput
- 17:44experiments that they see now,
- 17:45for example single cell that's
- 17:47not there anymore problem because
- 17:49you have usually a lot of data
- 17:51points and so that's less of a
- 17:53problem that sometimes we work.
- 17:55Is it after him because they thought
- 17:58the number of data are increasing?
- 18:00And not
- 18:01that generic comment comma. These people
- 18:04are a lot of these lot of our group is blood
- 18:10hematology researchers. Yeah,
- 18:11and neither blood nor blood advances require.
- 18:16The investigator in their papers
- 18:17to show all the
- 18:19data points. And now
- 18:21I'm on the publication committee.
- 18:22We've actually talked about this,
- 18:24but we go by the Journal of Cell Bio.
- 18:28Instructions to authors and prep for figures,
- 18:30and there are Rockefeller Press
- 18:32publication. And they
- 18:34haven't. So they have genome research
- 18:36and germ cell bio Med and stuff,
- 18:38so they haven't come around to making
- 18:41people show all their dots etc.
- 18:43But a number of journals, as you know,
- 18:46half like JC I you know JC AI
- 18:50advances etc. They might not
- 18:51even review your paper if
- 18:53you show, for instance,
- 18:54your plots on the left here.
- 18:56Well, they might, you know,
- 18:57might not even go out
- 18:59for review. The pre reviewer's will say
- 19:01you know your figures are inadequate
- 19:03for our instructions, authors etc etc.
- 19:05So I think some journals are
- 19:07coming around to this is the way
- 19:09we really want to see the data.
- 19:12Yeah, I think there is a shift
- 19:15in the paradigm, let's say,
- 19:16and it will take years.
- 19:18But for example, I have a slide here
- 19:21where so this is from my experience,
- 19:24so that for example,
- 19:25all the family of the network journals
- 19:28have already this policies for the figure.
- 19:31So this is something I received after
- 19:34the review of a paper as an editorial
- 19:38guidelines and the food for these like.
- 19:41Policies that I had to change a
- 19:44lot of figures and you see that.
- 19:47And so that the one of the
- 19:49policy as you see here,
- 19:50the last one is that for sample
- 19:52size that are less than 10.
- 19:54And they want you to get to plot
- 19:57the individual data points and
- 19:58so they don't accept bar graphs.
- 20:00Got bargraphs anymore.
- 20:03And then, for example,
- 20:05if you have some statistics such as
- 20:08error bars with the lesson 3 replicates,
- 20:11you have to remove,
- 20:13remove them and you have to show
- 20:15to show the data without the
- 20:18statistics without the error.
- 20:20Then this also is a point that
- 20:23you usually is not satisfied.
- 20:25So when you plot some statistical
- 20:28significance values,
- 20:29they don't accept anymore,
- 20:31they start the stars.
- 20:33But you have to provide the
- 20:35precise P value in the figure.
- 20:38It means that you have some stars.
- 20:40You have to change the stars that
- 20:42converting start to the precise P
- 20:45value before before publishing and
- 20:47then also you have to provide the
- 20:49precise number size for each of your bars.
- 20:52For example,
- 20:53I mean I,
- 20:53I think in the past it was enough
- 20:56to provide a range like from
- 20:58three to six replicates,
- 21:00but now they really want the number for each.
- 21:04For each app and population,
- 21:05for each sample that you have.
- 21:08So these are,
- 21:10in my experience were something
- 21:12that I had to provide that,
- 21:15but after the radio so it was not.
- 21:18It was the editorial like.
- 21:22At stage of acceptance of the paper,
- 21:24and I think this is true now for all
- 21:28the families of the of the natural.
- 21:32Jordans
- 21:34it can I add something?
- 21:36Although this is only for
- 21:38publication that goal of publication,
- 21:39but it's important that we start
- 21:42practicing all these rules in
- 21:44our daily life because it's so
- 21:46painful that you have to do this
- 21:49when you you're trying to get
- 21:51the figures into the Journal.
- 21:53It's a lot easier to do it while you're
- 21:56making the figures in real life.
- 21:59Yeah, so obviously it
- 22:01says worker before there.
- 22:02Yeah it says work because otherwise
- 22:05you have to repeat all day. Fevers so
- 22:10yeah also echo that,
- 22:11and also just want to say that you know
- 22:14I used to just use Excel to placings.
- 22:17But since my many of my lab members
- 22:19start to use Graphpad prism to plot,
- 22:22that makes a huge difference in
- 22:24converting between different types
- 22:25of parts such as this kind of things.
- 22:28If you had a bar bar graph,
- 22:30Indiana in that software,
- 22:31then you can very easily change that to a
- 22:34bar graph with different dots distributed.
- 22:36So it's very easy to work with.
- 22:40Yeah, that's also I have something
- 22:42at the end of the presentation.
- 22:44So basically there are a lot of tools now
- 22:46more or less commercial, but tequila.
- 22:49They aren't really available.
- 22:51U as which are too many different formats
- 22:55and starting with the same initial data,
- 22:58basically formatted as a table.
- 23:01So that from the same table you can switch
- 23:03to there too many different visualizations.
- 23:06So that's that's true,
- 23:08and it's probably easier also to plot these
- 23:11dots with single dots as it was in the past.
- 23:15Without respect.
- 23:18OK, so that was the main point of this part.
- 23:22I had a part on the standard
- 23:24deviation standard error.
- 23:26That's another issue because the
- 23:28standard error is basically the
- 23:29standard deviation divided by the square
- 23:32root of the number of experiments,
- 23:34and so usually the standard
- 23:36error is displayed.
- 23:37But you have just be careful that it's
- 23:40a measure that tends to go to zero
- 23:43just because they increase the number
- 23:45of replicates or the number of points.
- 23:48So you see an example here where
- 23:50it seems by plotting the standard
- 23:52error that the black bar and the
- 23:55white bar have the same like measure
- 23:58of spread of the data.
- 23:59But if you look at the standard
- 24:02deviation you see that this is
- 24:04an effect of the factor.
- 24:06Today the Black bar has higher spread,
- 24:08but also more points,
- 24:10and that's why the standard
- 24:12error seems seems the same.
- 24:16So that's another another issue.
- 24:18So obviously for publication at the
- 24:20standard error of the mean is preferred,
- 24:23because it usually gives an impression
- 24:26of the data being less sparse.
- 24:30But especially with different
- 24:31number of samples in different
- 24:33in different bars that it could.
- 24:35This could be misleading.
- 24:40And all these issues were presented
- 24:42in these in this paper published
- 24:44five years ago in in plus biology.
- 24:49I would skip this,
- 24:50just that we will touch this later,
- 24:52but an alternative solution if
- 24:54you have enough data points.
- 24:56So I would say more than 10.
- 24:59An alternative solution instead
- 25:01of showing like but lotsa Ann
- 25:03is to show the distribution of
- 25:05the data is box whisker plot.
- 25:07As you see here they have some light
- 25:11model with more details on this.
- 25:14OK, so this is the next example.
- 25:17I think it's a biplot with the
- 25:19usage of the different browsers,
- 25:22so this is extra science image so.
- 25:26So this is a classic example
- 25:29in like visualization lessons.
- 25:31So what could we run with this?
- 25:39There's no end. Yeah, so that's a yes,
- 25:44so there is no endless so that you cannot.
- 25:47You don't know of how many,
- 25:50how many data points you use that in
- 25:52order to build the other frequencies.
- 25:55Obviously pie charts are used to display
- 25:59frequencies and proportions of some
- 26:02classes that sum up to 100 or or to one.
- 26:05The main problem is that so
- 26:07the idea is that you shouldn't.
- 26:10You should avoid by chance.
- 26:12So the idea for displaying an
- 26:15information of our proportion or of
- 26:17a percentage as a pie chart are is.
- 26:21Not the best choice.
- 26:23Because that it was shown that humans
- 26:27are very bad at reading angles,
- 26:30so we're not very precise,
- 26:32precise in understanding differences between
- 26:35angles and so between the designs of the.
- 26:39Slices of the pie and so usually if you
- 26:44convert the pie chart into a bar plot.
- 26:47Information is much more clear.
- 26:49It's true that the pie
- 26:51chart is more aesthetic.
- 26:53Appeared, but the bar plotter
- 26:55is in in any circumstances,
- 26:57usually more affecting in displaying girl.
- 26:59For example, differences in the
- 27:02usage of this genome browsers.
- 27:04So this has been a long issue and if you
- 27:07in many presentation so there is always
- 27:10this suggestion to avoid at all a pie charts.
- 27:13There are also some example of these.
- 27:15So these are three pie charts and you can see
- 27:19that it's they are different from each other.
- 27:22But it's very difficult to
- 27:24understand that the difference,
- 27:25so the difference is is in the size
- 27:28of the slice of the three pies,
- 27:31but it's very different.
- 27:32For example, to understand in each
- 27:35pie which one is the largest slides.
- 27:38And to draw comparison it much more
- 27:41more easier to understand these issues.
- 27:44So which pie is larger if the information is
- 27:48not displaced is not displayed as pie charts,
- 27:51but as market.
- 27:54So that's on the web.
- 27:56I also found these provocative.
- 27:59Label of pie charts as lighters.
- 28:03So in general it would be better to avoid
- 28:06displaying information as pie chart.
- 28:08And prefer a bar chart instead
- 28:10to show the same information.
- 28:14OK, so that was faster.
- 28:16This is another example.
- 28:18What could be wrong with this plot?
- 28:20Again, we have a treatment.
- 28:21We have a control.
- 28:22This time we see the data point.
- 28:26Scale is so wrong, so it covers
- 28:29the distribution of the lower end.
- 28:31Yes, exactly so this is a case where
- 28:34most of the data are compressed,
- 28:37since they have very different magnitude.
- 28:39Most of the data are compressed
- 28:42air in a very small part of the
- 28:46plot and we cannot understand.
- 28:48Very much how they are distributed
- 28:51because most of the plotter is related
- 28:53to these kind of two outliers.
- 28:56So this is an issue with the measures
- 28:58that have different magnitudes,
- 29:00so it could in my experience it happens.
- 29:03For example in gene expression measurements.
- 29:07Because they can vary,
- 29:08especially with the sequencing.
- 29:10They can value of four to five.
- 29:13Magnitude and the main way to solve this
- 29:16issue is to log transform the data.
- 29:19So instead of plotting in a
- 29:21linear scale to log normalizing,
- 29:24the scale of the data and this
- 29:26allows to restrict the distance
- 29:28between these two points,
- 29:30the outliers,
- 29:31and allow you to see also the
- 29:34distribution of the points.
- 29:36That here seems all compressed.
- 29:40So usually log transformation allow you to
- 29:43capture some information on the difference
- 29:45of your points that are more clear.
- 29:48Not in all cases, but in some
- 29:50cases rather than displaying the
- 29:52information in a linear scale,
- 29:55especially when you have a lot of range
- 29:58between your minimal and maximal.
- 30:00Measurements. An alternative way.
- 30:04Is not also a panel breaks,
- 30:07so personally I prefer log log
- 30:10transformation over panel breaker because
- 30:13there is mathematically more likely.
- 30:16Linear or elegant,
- 30:18but there are situations where you can.
- 30:20You can choose so this is an example.
- 30:24You have a bar chart.
- 30:26You have a huge difference between
- 30:28the measurements of a 2D and E&F.
- 30:31So this is how you solve the problem by
- 30:35introducing a breaker in your panel.
- 30:38So from 25 to 200 to 210 and this is the
- 30:41equivalent solution by log transformation.
- 30:44As you see,
- 30:46the solution that the two solutions
- 30:49give a fight a similar result.
- 30:51But here you insert the manual break of
- 30:54the data and this could be misleading.
- 30:57Here you saw the issue by log
- 31:00transforming all the measurements.
- 31:02So this is for example is an advantage
- 31:04because it affects all the measurement.
- 31:07And while this panel breaker
- 31:09affects only for example,
- 31:11these two bars and could distort the data.
- 31:18Another another scenario where you should
- 31:22consider log transformation is these.
- 31:26This could be a plotter that shows for gene
- 31:30expression levels from Aaron Isike for.
- 31:33Population of jeans.
- 31:35So each gene could be a doctor
- 31:37and he received it.
- 31:39There is a different year age,
- 31:41but you see that there are outliers like
- 31:44for example genes of ribosomal proteins.
- 31:46Histones usually are in these.
- 31:49Are in this part of the plot,
- 31:52but most of the gene are 90% of
- 31:54your jeans are in this part of the
- 31:56plot and you cannot really see.
- 32:01You cannot really inspect them
- 32:03because most of the plot is
- 32:05dedicated to some outliers.
- 32:07So again, here is a situation
- 32:09where you can log transform.
- 32:11Both are the coordinates.
- 32:12So let's say that here is the
- 32:15control and this is the treatment
- 32:17and this will allow you to see more
- 32:20in detail the differences in the
- 32:22expression of the bug of your jeans.
- 32:28In a situation like Visa,
- 32:30you should also consider
- 32:32issue if you're interested,
- 32:34for example in showing differences in
- 32:37expression between 3 between a control.
- 32:39For example, are one and
- 32:41the treatment are two.
- 32:43You have also the possibility to show.
- 32:48As the Y axis,
- 32:50the differences in the log values.
- 32:52So this representation
- 32:53here is the same as this,
- 32:56but it maximizes the visualization
- 32:58of the differences in the
- 33:00expression levels of genes.
- 33:01So this is something that you
- 33:04find a cold as as an MA plot.
- 33:07It was introduced with the
- 33:09analysis of microarray data,
- 33:11but you can find it also
- 33:13with sequencing data.
- 33:15Sometimes these two different
- 33:16visualization are used.
- 33:17Depending on the aim of the figure,
- 33:20so sometimes you will find these,
- 33:21especially when the message of the
- 33:23figure is that you don't see big
- 33:25differences between the two conditions,
- 33:27while if the message is that you find big
- 33:29differences between the two condition,
- 33:31you will find mostly these visualization.
- 33:35So here I would just point out that
- 33:37in any at any sequencing experiments,
- 33:40you will probably never find any gene
- 33:44that is in this area because they.
- 33:48But most of the genes,
- 33:50the main difference they make
- 33:51the main like determinant,
- 33:53is the basil expression levels.
- 33:55So usually your perturbations do not
- 33:57affect so much the expression of a gene,
- 34:00so that the gene is in these
- 34:03area of the plot of the oranges.
- 34:06And that's why this visualization
- 34:08is much more efficient in capturing
- 34:10the expression differences.
- 34:12Because they scale on on the
- 34:15expression at baseline.
- 34:19OK, so now I have a section I don't
- 34:21know that I'm I have a section
- 34:24about how to display distributions.
- 34:29So let's say that
- 34:29we have a display. One time you had 15
- 34:32minutes and if we go a little over, that's
- 34:34OK, OK? So when you have to
- 34:37represent the distribution of data,
- 34:39you have many choices.
- 34:40The histogram is one of the most used choice.
- 34:44It has the advantage that it can present.
- 34:48With detail, the shape of the
- 34:51distribution of your data.
- 34:53And so basically you have a variable of
- 34:56interest that usually is a continuous
- 34:58variable and you wanted to show
- 35:01how this variable is distributed.
- 35:03So you divide the range of the values in
- 35:06some beans and then you count the number
- 35:09of points that fall inside each being.
- 35:12The issue with the histograms is
- 35:14that you should be careful when when
- 35:17building the histograms and when looking
- 35:19at the histograms that there are
- 35:22some are being arbitrary parameters.
- 35:24In building up his histogram,
- 35:26mainly the choice of the bin size.
- 35:30So this is an example where the same
- 35:33distribution of data that is the
- 35:35distribution of the price of abedy
- 35:38apartments in French City has been
- 35:40being there in two different ways.
- 35:43So here is the price and hear the bin sizes.
- 35:47So the size of each of the bin is 10.
- 35:52Dollars.
- 35:53While it in here on the writer it is
- 35:57of $2 so you can see that using more
- 36:00granular bins allow you to see some
- 36:04the presence of some accumulations
- 36:06in your data that you cannot really
- 36:09see with the larger bin size,
- 36:11and this could be important because
- 36:14these accumulation this probably
- 36:16are accumulation of price that are
- 36:19due to the fact that they are prices
- 36:22that are commonly used.
- 36:23By many different Airbnbs, for example,
- 36:26because they are multipliers of 50 or 100,
- 36:29for example.
- 36:30But the fact is that depending on
- 36:33the choice of the bin,
- 36:34you see a different story.
- 36:38And then you should be always
- 36:42careful to select bin size.
- 36:45That doesn't affect too much data.
- 36:49There are also software tools
- 36:51that calculates depending on
- 36:53your data depending on squared,
- 36:56your points are placed the best
- 36:58and size of the bins so that you
- 37:02reduce the distortion of your data.
- 37:09An alternative way to represent
- 37:10distribution is to use a density plot.
- 37:13So a density plot is basically
- 37:15a smoothing of a histogram.
- 37:18Here you collect being said and here
- 37:21use motor the shape of the distribution
- 37:23so that you have a continuous function.
- 37:27This is graphically nice.
- 37:30And it allows to compare,
- 37:31for example distribution of
- 37:33two variables as you see here
- 37:35in green and in and in Violet,
- 37:37and the advantages that you can see also
- 37:40complex shapes of the distribution.
- 37:42For example here the bimodality
- 37:44or hear the presence of this show
- 37:46is that of the distribution.
- 37:48The pitfall is similar to the histogram,
- 37:51so you should always be careful
- 37:53in selecting the.
- 37:54How much is Martha the distribution?
- 37:56And here you see an example.
- 37:58So these are the.
- 37:59Points that were used at the single
- 38:02points that were the that were used
- 38:04in order to build the distribution.
- 38:07They were randomly chosen from a normal
- 38:10distribution and you can see that.
- 38:12Problem is similar to the bin size,
- 38:14so here you have to select basically.
- 38:19A wavelength in order to approximate
- 38:21that the function to a curve
- 38:23and depending on the wavelength,
- 38:26the resolution of the
- 38:27wavelength that you choose.
- 38:29The result is different,
- 38:31so you could have this kind of plot
- 38:34that seems to show a lot of local pixel,
- 38:38but by smoothing more you have
- 38:41instead the normal distribution
- 38:43from which you draw the data so.
- 38:46There is a balance which appear
- 38:48in choosing beings that are two
- 38:51larger or hear excessive smoothing.
- 38:53Because these over simplifies
- 38:54the original distribution,
- 38:55but on the other side,
- 38:57if you take a resolution that is too small,
- 39:01too granular,
- 39:02you can obtain that strange effects.
- 39:04So you could see for example,
- 39:06pics that are depending on the
- 39:09extraction of random numbers.
- 39:11Again,
- 39:11also in this case there are softwares
- 39:15that given the the original data,
- 39:18your original vote data can calculate the
- 39:22optimal smoothing wavelength in order
- 39:26to avoid distortions based on your data.
- 39:29A compact way to represent the
- 39:31distribution is the box whisker plot,
- 39:34and here you can see how a box
- 39:36whisker plot they can be obtained
- 39:39by this distribution of 20 points.
- 39:41So basically the box whisker plot
- 39:43represents as a box 50% of the data
- 39:46of the distribution to.
- 39:48Usually you have a central line
- 39:50that is the media.
- 39:51It's important,
- 39:52not laminar,
- 39:53but in the box whisker is always the medium.
- 39:56This is the first quartile
- 39:58and the third quartile.
- 40:00420 Percent 25th percentile of the data.
- 40:0375th percentile of the data.
- 40:05So in the box you have 50%
- 40:07of your day to the central.
- 40:09Here 50% of your data.
- 40:11Then you have the whiskers.
- 40:14They are standard definition of the
- 40:17Whisker Lanka is that they are as
- 40:20long as the interquartile range.
- 40:22That's the distance between Q1 and Q 3 * 1.5.
- 40:27And you see these as the whisker
- 40:30of your plot.
- 40:32So these collect most of the
- 40:34distribution of your data.
- 40:36The data that are outside the whiskers
- 40:38are considered to be outliers.
- 40:40For example,
- 40:41here you see there these three points.
- 40:44They are outside the whisker size,
- 40:46and so these usually are individually
- 40:49displayed in the whisker plot and are
- 40:52considered to be an outlier according
- 40:54to this definition of the whiskers.
- 40:57Yes,
- 40:58if you wanted
- 40:59to make these plots, yeah,
- 41:00is there an easy way to do it
- 41:03or do you like you personally,
- 41:05just do it by in R or something?
- 41:08Well, box plot. I don't think
- 41:10you can do them with Excel,
- 41:13but for example with Prisma
- 41:15or Origin you can totally.
- 41:20I think the only limitation is is
- 41:22Excel, but I be honest, I didn't
- 41:24check the last version of Excel.
- 41:27Right for us to think about,
- 41:28you know we can we have our data and there
- 41:31are many different ways of plotting it,
- 41:33but it sounds like prison might be the
- 41:35way to go in to try to do it in less.
- 41:37You're somebody like you.
- 41:38Who knows how to put it into our.
- 41:41Yes, probably, so please MA is it?
- 41:45Give you an option that is much.
- 41:47Use that if usually use them
- 41:48originally with respect to Prisma.
- 41:50I think it has more.
- 41:52I'm more power,
- 41:54so there are more things that you
- 41:56can do with origin then please MA.
- 41:59I think because it was designed
- 42:01for the for the physics community,
- 42:04but the tradeoff is always complexity,
- 42:06so please May is has less power,
- 42:09less choices, but it's easier
- 42:10to use rather than than origin,
- 42:13but both share the same philosophy
- 42:15so that you need to provide the data
- 42:18is a spreadsheet format and they are
- 42:21available in the software library at.
- 42:24OK, thank you to my can you say
- 42:27the name of the other not prism
- 42:29but the other programming?
- 42:31Or I have a slide after whether you show
- 42:34its origin? OK, thanks yeah.
- 42:37Ava question so,
- 42:38so my initial understanding is that
- 42:40the whisker lenses representing the
- 42:4295 percentile of the data range.
- 42:45But here it says the whisker
- 42:47length is 1.5 times this IQR lens.
- 42:50But if that's the case,
- 42:52why would the left side of
- 42:54the screen right side of risk
- 42:57are having different lens?
- 43:02Um? So that could be for example
- 43:06because here you have the, so that's
- 43:09the the maximal length of the whisker.
- 43:12But if the minimum of your
- 43:14data that is here is here,
- 43:17the whisker stops. So that's why.
- 43:19So I see here you have outliers and
- 43:22so that we can extend to the maximum
- 43:25point that is 1.5 at this measure.
- 43:27But if you before the the maximal distance
- 43:30here you meet the minimal pointer,
- 43:32the whisker and there and
- 43:34there so that's why.
- 43:36OK, I see it's also true that these
- 43:39whisker definition can be customized,
- 43:41so this is the default interpretation.
- 43:43I don't know who who decided this.
- 43:46I don't have the original publication,
- 43:48but you can choose whiskers to
- 43:50be differently, so that's why.
- 43:52Also in the Network Journal
- 43:54paper when you do a box plot,
- 43:56you have always to specify in the statistical
- 43:59methods how you design your box plot.
- 44:02So you have to provide how,
- 44:04for example, the skirts were defined.
- 44:07Because sometimes it's true that,
- 44:08for example,
- 44:09the whisker can represent like
- 44:1195% of the distribution.
- 44:13Right, so this is just the default,
- 44:15but it can be customized,
- 44:17so there are different choices.
- 44:21I have a question regarding the
- 44:24distribution again, maybe it's in
- 44:26continuation to what you just said.
- 44:30Some softwares allow a default value
- 44:32for the bin size and for the smoothening
- 44:36and all that say like Matlab that
- 44:38I've been trying to put this into.
- 44:41How reliable do you think that is?
- 44:44The default values and how would you suggest?
- 44:48Most of the time,
- 44:49most of the time, so I don't.
- 44:52I don't have experience with matter,
- 44:53but probably it will be that it's the
- 44:56same in our so so most of the time
- 44:58there is a sort of optimization there,
- 45:01so most of the time is fine. Uh, but.
- 45:07Sometimes, especially if you
- 45:09have a distribution of data,
- 45:11but you also have a pointer
- 45:14with cumulation of data.
- 45:16You could have problems in the.
- 45:20In the blocker so.
- 45:22But I don't have an example.
- 45:25OK, so like in 95% of the time I'm OK
- 45:30with the with the solution that is
- 45:33provided by the MATLAB or RA building tool.
- 45:38For example, sometimes when you compare
- 45:40to distribution with a different size
- 45:42with a different number of points,
- 45:44that could be that that can be a problem.
- 45:48Because sometimes there.
- 45:50If you're comparing for example
- 45:52distribution with 10 points with
- 45:54a distribution of 1000 points.
- 45:56Adopting the same wavelength
- 45:58could be a problem,
- 45:59and so you need to manually change it.
- 46:03So that's the yes,
- 46:05but that's that probably could be a.
- 46:08A practical example on when it's not ideal.
- 46:12Because the software,
- 46:13if you are trying to compare a
- 46:1610 points versus 1000 points,
- 46:18tries to define a common wavelength.
- 46:21But sometimes this leads
- 46:23to like distorted images.
- 46:25I don't have an example to show.
- 46:29That's good enough, thank you.
- 46:32And well, I can leave the note.
- 46:35Sometimes you can see also the
- 46:37nutshack inside your box whisker,
- 46:39so they're not sure is diesel
- 46:42feature that it represents a measure
- 46:44of certainty for the medium.
- 46:47So sometimes it is useful to have.
- 46:49These are 'cause if you are comparing
- 46:51a lot of box whisker plots a you can
- 46:54look at the uncertainty as if it was a
- 46:56sort of standard error of the media.
- 46:59And so if two box whisker overlapping,
- 47:02they're not.
- 47:03She's probably it means that the
- 47:05medians are not statistically.
- 47:08Inefficiently different.
- 47:09This could be a way to.
- 47:12The use of the notch or there is
- 47:15the interpretation of the data
- 47:16and the comparison of different
- 47:18distribution and that's why the
- 47:19box whisker plots are so popular,
- 47:21because they allow you to represent
- 47:24that distribution of data in
- 47:25a very compact format.
- 47:27This is another display of the
- 47:29anatomy of Big box whisker,
- 47:31but it doesn't add anything
- 47:33that I had also before.
- 47:35So here is an example where box whisker
- 47:38plots are used in order to compare
- 47:42the four different distributions.
- 47:43So the advantage is that they allow
- 47:46easy comparison so it's easy to
- 47:49compare the distribution of ABC and D.
- 47:52The problem they can have is that they
- 47:55hide the shape of the distribution.
- 47:58And also usually they hide the
- 48:01number of points that were used
- 48:03to build the box whisker.
- 48:05Sometimes you can code the number
- 48:08of points so the cardinality the
- 48:10size of the distribution as the
- 48:13width of the box whisker,
- 48:15but it's rarely used because it's not
- 48:18very visually beautiful, I would say.
- 48:21So one solution it could be to
- 48:23overlay over the box whisker,
- 48:26plot the jitter plot,
- 48:27so jitter plot represents the single
- 48:30points that were used to build
- 48:32the box whisker plot and they are.
- 48:34So while on the Y axis that there
- 48:36is the precise values on the X
- 48:39axis there randomly.
- 48:43Place that let's say there are
- 48:45also methods that do not display
- 48:47these points randomly butting up.
- 48:49Sell the random way that captures
- 48:50the shape of the distribution,
- 48:52and I think that that kind of plot
- 48:55is also present in in graph for
- 48:58the Prisma so the advantage of this
- 49:00is that you can see, for example,
- 49:03that would be the distribution is bimodal.
- 49:06So because you see that there are
- 49:08these high densities of points and
- 49:10the box whisker plot cannot capture
- 49:12that you cannot see from a box,
- 49:14whisker plot data distribution is
- 49:17bimodal and for example here you
- 49:19can see that these box whisker
- 49:21plot there has been is based on
- 49:24much less data than the others.
- 49:26So, uh, and a solution for these
- 49:29are is to enclose the box whisker
- 49:32plot into a violin plot.
- 49:35So violin plot representation
- 49:37like these allow you to see the
- 49:40same information of a box whisker,
- 49:43but also information on this shape
- 49:46of the distribution is basically
- 49:49in a violin plot.
- 49:50You add a density plot that is
- 49:54parallel to the vertical axis.
- 49:58And here, by using a violin plot you can see.
- 50:00That this,
- 50:01that this distribution is one pick
- 50:03has one pick. This one is by model.
- 50:08And you can add also the number here.
- 50:11He said of coding the number as
- 50:13the size of the distribution as
- 50:15the width of the distribution.
- 50:19So this is an example of compare
- 50:21of comparisons between different
- 50:23ways to show distribution.
- 50:25Here you see the histogram with the density,
- 50:28corresponding density plot,
- 50:29the same distribution
- 50:31visualized as a box plot,
- 50:33and visualized as a violin plot that
- 50:36captures both the features of a box
- 50:39plot cluster the density distribution.
- 50:42And this is for a normal distribution.
- 50:44This is for a bimodal distribution where you
- 50:47can see that the box plot doesn't capture,
- 50:50so the box plot can capture the fact
- 50:53that the data are not symmetrical and
- 50:55you see the for example the distance from
- 50:58the from the from the point of the box
- 51:01and the medium is much more than these.
- 51:04So the box whisker is good in capturing
- 51:07a symmetrical distributions but not
- 51:09the presence of more than one piece.
- 51:11So not the complex shape of the distribution.
- 51:15And there is a website here where
- 51:17you can where you can see a lot of
- 51:21examples where the different choice of
- 51:23visualization can lead to different.
- 51:26Conclusion as here.
- 51:29It's true also that the violin Plata
- 51:32is not efficient because you're
- 51:35sure you're showing twice.
- 51:37The same information,
- 51:39so this is aesthetically pleasant,
- 51:42but is not efficient because
- 51:44you're repeating basically this
- 51:46density twice above and below,
- 51:49and so that's why there are two
- 51:52saver for efficiency sufficiency.
- 51:55There are recent visualization
- 51:57strategies as the rain cloud plotter.
- 52:00So the Raincloud plot that shows a box
- 52:03whisker plot in the middle half violin
- 52:06plot here and then also the single point.
- 52:09So that's probably the one of the
- 52:12most complete exhaustive ways to
- 52:14represent a distribution of data.
- 52:16And they're called the rain cloud because
- 52:18of this effect is should be the cloud.
- 52:20And this is the rain that falls
- 52:22on the proposed below.
- 52:23So you can find information on
- 52:25how to block these are following
- 52:28the following these link.
- 52:29Another yeah.
- 52:31Quick question, is there a?
- 52:35How to say the restriction or limitation
- 52:38as to how many data points are required
- 52:43for generating reliable violin plot?
- 52:50Generally not so. Probably more than 10,
- 52:55I would say because otherwise so you can
- 52:58see that you can see it empirically,
- 53:00because if the data are too few you can
- 53:03see that the violin basically have sort
- 53:06of waves around each point of your data.
- 53:10So as a general. Is a general threshold.
- 53:15I would say 10 points would be the
- 53:19like the minimum number. And asking
- 53:21that question is of course if
- 53:23you have a lot to data points,
- 53:26these would be informative.
- 53:27But if you have, let's say
- 53:29less than 10 or small number,
- 53:31this could be really distorting
- 53:33or faking the. Yeah, yeah, that's
- 53:35true. That's why I would
- 53:37say 10 because it below 10.
- 53:39Probably the best strategy is to show
- 53:41the single points and then a summary
- 53:44such as the mean or median plus
- 53:46some validation standard dialogue,
- 53:48but not the not the distribution
- 53:50as a violin plot.
- 53:51So that's for a like less than 10 data.
- 53:57Alright, when data are too much,
- 53:59for example, it doesn't make
- 54:00sense to show the single points.
- 54:03Because that they are overlap,
- 54:04they overlap each other and so you
- 54:06don't see anything that happens
- 54:08when you have more than 1000 points,
- 54:10and so the best solution in that case
- 54:12is for example to show only the violin.
- 54:18So there is a Ranger for which.
- 54:22The best solution is to show the
- 54:24single data points with the cross bar,
- 54:27so an element with captures mean or
- 54:30median plus standard deviation order.
- 54:33Confidence interval there is a
- 54:35Ranger that is in the middle from 10
- 54:38to some hundreds where the violin
- 54:40plot and the box whisker plot are
- 54:43the best option to visualize.
- 54:44And when you have many,
- 54:47many data more than 1000, probably.
- 54:48If you want to capture the distribution
- 54:51then only there the violin plot rather
- 54:53than the single points is the best way.
- 55:01Did did he? Did it answer?
- 55:06Yeah, that was awesome.
- 55:08That is a great explanation.
- 55:11OK, another another alternative
- 55:13way to maximize efficiency of the
- 55:16violence that I saw a lot in the
- 55:18with single cell data, for example,
- 55:20is the the user split violin plots are,
- 55:23so you use the violin plot to show a
- 55:26comparison between two distributions.
- 55:29So you see here are this plot shows the
- 55:32representation of Asia or female and
- 55:34males are using different social, social,
- 55:37media, Instagram, Facebook, Twitter.
- 55:38So it's a way to show using.
- 55:41Half of a violin plot are differences
- 55:44in the distributions and this can
- 55:46be used when you have a contrast
- 55:49of two conditions or you want to
- 55:52compare two distributions.
- 55:53I'm also in the single cell.
- 55:57About the violin. Plots,
- 55:58like in the such cases, yeah,
- 56:00So what determines the height of the peaks?
- 56:03Or is that everything is normalized
- 56:05so that the total area the same,
- 56:07or the maximum height is the same?
- 56:10So most of the time,
- 56:12so you have choices usually so you can
- 56:16choose to have the same maximum hate.
- 56:20And that's usually the then.
- 56:22That's usually what you find,
- 56:24so you you plot there in a way
- 56:27that the Ranger is the same from
- 56:30here to here from here to here,
- 56:33the alternative is to use the
- 56:35real criteria for a for a density,
- 56:38and that should be that the
- 56:41area under visa is equal to 1.
- 56:45And so that the two have the same area.
- 56:48An alternative is to have an
- 56:50area that is proportional to
- 56:52the number of observations,
- 56:54but I think that visually most
- 56:57of the time you find that.
- 57:00The criteria is that you have in order
- 57:02to have balanced plots are the criteria,
- 57:05is to have the same Ranger.
- 57:07Meaning from here to the maximum
- 57:09for all their pull the plot
- 57:11independently from the area and
- 57:13dependently from the number of points.
- 57:18It's not probably the best solution from
- 57:20the point of view of communication,
- 57:21but it's most used. OK, thank you.
- 57:27I variation of this is also the
- 57:30use of ridgeline plots are that.
- 57:32They allow you to compare a
- 57:35lot of different densities.
- 57:37For example, here you see a comparison
- 57:40of the density of temperatures in
- 57:43different month in allocation metadata.
- 57:45Remember Lincoln NE and this
- 57:48is used in a single cell.
- 57:52Is Alotta now in these years
- 57:54with single cell data?
- 57:55For example,
- 57:56here you see that it is used to
- 57:58compare the distribution of the
- 58:01expression of 1 gene leads A
- 58:03or CL5 in different population
- 58:05of cells that are probability
- 58:07can from some blood sample.
- 58:10Different population and
- 58:11these allow you to see.
- 58:13Sorry to see an marker genes or to
- 58:16see how the expression of a gene is
- 58:19specific for a population of cells.
- 58:22So that's why I included because I
- 58:25see that the frequency of this plot,
- 58:28specially in the single cell
- 58:31visualization field is quite increasing.
- 58:34I have visa section of the
- 58:36presentation that we could skip.
- 58:38In general the message about Visa
- 58:40is that Venn diagrams are good
- 58:43when you have two Venn diagrams,
- 58:45but if they are,
- 58:46if they're more there a bad way to
- 58:49represent intersections between sets.
- 58:51And this actually is a plot that
- 58:54was published in Nature and it it's
- 58:56about a comparison of the genome
- 58:59of banana with other species.
- 59:01So the problem in general is that
- 59:04when you have more than two, 3,
- 59:06four but also two Venn diagrams,
- 59:09it's it's not the best way to
- 59:11visualize intersection with the use
- 59:13of the traditional Venn diagrams.
- 59:15So a table is probably more effective
- 59:18than this because the areas are
- 59:21not proportional to the size.
- 59:23And it's quite confusing to see
- 59:26the specific intersection and
- 59:28so on alternative way.
- 59:30That was developed in the recent year
- 59:32was the user the concept of this
- 59:35upset plots are so to represent the
- 59:37intersections in a matrix format.
- 59:40So represent these are as a member
- 59:42as a sum object.
- 59:44Example,
- 59:44a gene that is present on only
- 59:46List A only list D only list C
- 59:49intersection between AMD origin
- 59:51present in all the intersections.
- 59:53So you can use these matrix format
- 59:55to show the intersections and then
- 59:58you can display the cardinality.
- 01:00:00Of each.
- 01:00:01Intersection so the number of genes,
- 01:00:03for example that are only in the PDF
- 01:00:06error pathway that you see here.
- 01:00:08The number of genes that are in the
- 01:00:11common between the EGFR and P-10 path.
- 01:00:14With that you see here.
- 01:00:15So this is a way to show the cardinality
- 01:00:19of the global list that you see here.
- 01:00:22And also you can rank the intersections
- 01:00:24between the different sets according
- 01:00:26to their size to their personality.
- 01:00:29So it's much more clearer.
- 01:00:30To show the structure of the intersection.
- 01:00:34Rather than using the. A Venn diagram.
- 01:00:38I skip this because they are.
- 01:00:41There were some examples of bad
- 01:00:44usage of graphic in politics.
- 01:00:47And a lot are looking at online
- 01:00:50and related to Fox News.
- 01:00:52Of bad usage of klasa display.
- 01:00:54So the final part could be how to
- 01:00:56draw this pad. There is
- 01:00:58relative they were trying to.
- 01:01:00Not make the point. Yeah,
- 01:01:02well they were trying to make him
- 01:01:04to give a message by distorting the.
- 01:01:09Yeah, this for example is an
- 01:01:11issue if you always need to
- 01:01:13include the zero in your plot.
- 01:01:15Sir, this is controversial.
- 01:01:17Let's say that. In general,
- 01:01:19in Barplots it's a bad idea,
- 01:01:21but for example is a good
- 01:01:23idea in in time series,
- 01:01:25and that's because there in barplots
- 01:01:27the height of the bar plot is that
- 01:01:30your main message of the figure,
- 01:01:32while for example here in in a
- 01:01:34time series that the main message
- 01:01:36is how the two trajectories
- 01:01:38evolve and are interconnected.
- 01:01:40So the main issue is the horizontal
- 01:01:43axis and so you can skip the zero.
- 01:01:47So again,
- 01:01:47it depends on how much these inclusion
- 01:01:50or exclusion of the zero distort
- 01:01:53your your main message of the fever.
- 01:01:56So how to draw plots?
- 01:01:58So here are there is an outline
- 01:02:00of the software that you have,
- 01:02:03so this is some commercial
- 01:02:05software from the most from Excel.
- 01:02:07It's probably the most used or available,
- 01:02:10but it doesn't allow to plot all the
- 01:02:13solutions that they did show before, but.
- 01:02:16For example,
- 01:02:17Grandpa,
- 01:02:18Graphpad prism,
- 01:02:18or Origin Pro are through software
- 01:02:21that are available and with those
- 01:02:23that you should be able in an
- 01:02:26environment that is similar to Excel
- 01:02:28to produce most of the plots that
- 01:02:30you saw in the presentation today.
- 01:02:33So this is commercial software,
- 01:02:35doesn't require programming
- 01:02:36skill on these sides.
- 01:02:38Are you see the main solutions
- 01:02:40that are used by data scientists,
- 01:02:42but that require programming
- 01:02:44skills so that the two most
- 01:02:46common languages in data science,
- 01:02:48RR and Python so far are you have is GG plot.
- 01:02:53Library for Python.
- 01:02:54You have matplotlib or Seaborn.
- 01:02:58At these require programming so,
- 01:02:59but I would say that the advantage
- 01:03:02nowadays of using visa is that you can
- 01:03:05find a lot of really a lot of examples.
- 01:03:09Because there are a lot of website
- 01:03:12that where you can choose that
- 01:03:15you're like data visualization
- 01:03:17type and you see already the code.
- 01:03:20That you can use in order
- 01:03:22to produce the blocked.
- 01:03:23So I would say that you just need
- 01:03:26to know how to insert that or how to
- 01:03:28load that in the this programming
- 01:03:31environment table of data.
- 01:03:33And then most of the difficulties
- 01:03:35are probably in fixing details,
- 01:03:37so it's very easy to realize the plot,
- 01:03:40different plot.
- 01:03:41It's more complicated to adapt the
- 01:03:44small things that we are to your taste.
- 01:03:48But so so,
- 01:03:48this suggestion is that if you
- 01:03:50do a lot of visualization,
- 01:03:52it's worth investing in this.
- 01:03:57Here you see a maybe a future perspective
- 01:04:01that could be their own online solution.
- 01:04:04They're already available so summer,
- 01:04:06for example. You can produce upset plots,
- 01:04:09Aurora rain plot, Sir,
- 01:04:10or some other like exotic type
- 01:04:13of data visualization online.
- 01:04:15So there are websites,
- 01:04:17web web servers where you can insert
- 01:04:20your data as tables and they produce at
- 01:04:23the data that you want and you have.
- 01:04:27Some sort of interactivity,
- 01:04:28so that could be the future. Sure.
- 01:04:32Where are web servers provide you with
- 01:04:34the main programming environment?
- 01:04:36You need just to interfere data
- 01:04:38and you can see by interactively.
- 01:04:44By the interaction with the web server.
- 01:04:46How to customize the data?
- 01:04:48Most of the solutions right now are.
- 01:04:53Commercial, and so you need to pay,
- 01:04:56and that's the drawback of this.
- 01:04:58But it could be probably the
- 01:05:00future of matching the programming
- 01:05:02with easiness of usage.
- 01:05:06This is a useful resource that
- 01:05:08you can use also to decide which
- 01:05:11kind of blocked are you want.
- 01:05:13So there are a lot of these trees are that
- 01:05:16depending on what you want to represent,
- 01:05:19one numeric variable to numeric
- 01:05:21variables or categorical variables,
- 01:05:22you can follow the tree and arrive
- 01:05:24to your to the best graphical
- 01:05:26solutions to display your data.
- 01:05:28So I suggest you to visit it
- 01:05:30also to look at what are the
- 01:05:33kind of possibilities for data
- 01:05:35representations that you have online.
- 01:05:37There are many of these sites and
- 01:05:39now and that's why it's easy to
- 01:05:42look at the documentation and also
- 01:05:44to retrieve and reproduce the code.
- 01:05:46This is another example I closed with Visa.
- 01:05:52Patricia, that I find particularly
- 01:05:54related to data visualization and
- 01:05:57science is not natural itself about
- 01:06:00its nature under our observation.
- 01:06:03And so the science of data visualization
- 01:06:05is a way to allow more adherence between
- 01:06:10observation science visualization.
- 01:06:17Thank you come on.