Skip to Main Content

Fantastic plots and how to draw them

October 29, 2020

Fantastic plots and how to draw them

 .
  • 00:00Topic of today's is about speaking
  • 00:04about data visualization.
  • 00:05And so it will be very in general on
  • 00:09ramps how to design some strategy,
  • 00:12some issues, some principles to guide in
  • 00:15the visualization of the of our data.
  • 00:20So data visualization is important to
  • 00:22explore the data and this is particularly
  • 00:25crucial since nowadays data are becoming
  • 00:28much more complex and much more bigger,
  • 00:31and so in general there
  • 00:33is a rise of data science.
  • 00:35So not only in research,
  • 00:37not only in biological research.
  • 00:40The second function to data
  • 00:43visualization for data visualization
  • 00:45is to communicate the data and
  • 00:47that may be the most traditional.
  • 00:50I'm so that's what that.
  • 00:54Creating publication figures,
  • 00:55for example,
  • 00:56is about to communicate data to others
  • 01:01because communicating data visually is
  • 01:04more efficient than than words in general.
  • 01:09So in order to represent complex data here,
  • 01:13I collected 3.
  • 01:16General challenges and aims.
  • 01:18So whenever you plot the data is
  • 01:22important that the plots are and
  • 01:24the representations are are precise,
  • 01:27so they're truthful.
  • 01:28That means that distortion has
  • 01:31to be avoided as much as possible
  • 01:34is not always achievable,
  • 01:36so distortion sometimes is unavoidable.
  • 01:39Think about for example,
  • 01:41when you plot the 2D Maps
  • 01:44for representing 3D data.
  • 01:46But the point is that the
  • 01:48distortion doesn't have to convey
  • 01:49the message of the figure,
  • 01:51so it has to be something that is not
  • 01:54related to the main message of the feature.
  • 01:57Otherwise it's a problem.
  • 01:58Then the second point is clarity.
  • 02:01So data the figure has not to be ambiguous,
  • 02:06and the third one is the efficiency.
  • 02:09So every.
  • 02:11Inca every in every pixel is precious,
  • 02:14so each decision in doing your plotter,
  • 02:17each decision on the color on the size
  • 02:20on the number of layers said that you
  • 02:24that you plotter is important and it has an.
  • 02:28Everything has to be has to have a purpose,
  • 02:32so you should reduce the
  • 02:34so called chartjunk here.
  • 02:36Below the slide you see
  • 02:38quotation from Edward.
  • 02:39After that I discovered by the way.
  • 02:43Only yesterday is that he
  • 02:45never knew I'm I'm here from.
  • 02:48Since three years and they
  • 02:50never known that Edward Tufte,
  • 02:52it is the most one of the
  • 02:56most celebrated visualization.
  • 02:58Antisa is is is there in new heaven.
  • 03:03And the condition is that with an
  • 03:05image you have to give to the viewer
  • 03:07the greatest number of ideas in the
  • 03:09shortest time and with the least possible,
  • 03:11Inc.
  • 03:13This is another general representation
  • 03:15of his that you should consider
  • 03:18to make a good visualization.
  • 03:20Also, this is very general, so it's not.
  • 03:23It's not only about science
  • 03:25and basically they criteria are
  • 03:27organized in four different sets,
  • 03:29so you need to represent the information,
  • 03:32but the fever also need to display
  • 03:34to convey to communicate a story and
  • 03:37that's the concept of the figure.
  • 03:40This is connected also with the goal of the.
  • 03:43All of the figures, so the function.
  • 03:47So what is the message that you
  • 03:50want to display and also the visual
  • 03:53format is important.
  • 03:55Obviously the weight of these four
  • 03:58different layers is different in
  • 04:00different applications for images,
  • 04:02so visual form probably is more
  • 04:05important for artistic display,
  • 04:07while for scientific displays that
  • 04:09probably in formation goal and
  • 04:12story are more important.
  • 04:14This doesn't mean that you should not
  • 04:17consider also the visual visual part.
  • 04:20I ideally the perfect visualization
  • 04:22is at the center of these four steps.
  • 04:28So this is for the introduction now.
  • 04:31The rest of the presentation will be
  • 04:33structured with some very concrete examples,
  • 04:35and it's also organized in
  • 04:37a way that is interactive,
  • 04:39so I will show something.
  • 04:41Some example of figures and I will try to
  • 04:44ask you what could be wrong with this figure.
  • 04:48Starting with this,
  • 04:49but this is a figure that is very
  • 04:52frequent in scientific publication.
  • 04:55It's a barplot and it's the most.
  • 04:58It's actually the most frequent disk image
  • 05:01that you can find in biomedical journals.
  • 05:08Do you have any ideas of what
  • 05:11could be wrong with this?
  • 05:13Not pretty, yes. Lots of it.
  • 05:18It. It's lacking data.
  • 05:21It's it's not showing you
  • 05:23the data distribution.
  • 05:24Yeah, yes, exactly there are.
  • 05:26It's putting the treatment on the left,
  • 05:29which I always don't like.
  • 05:31I always want the control on the
  • 05:33left. Oh yes, OK.
  • 05:35Yes, that's true, yes,
  • 05:36so it has a lot of like. Capital A.
  • 05:42Visual problems, but the main yes.
  • 05:44The main thing is that it doesn't
  • 05:47show the data, so that's the main.
  • 05:49That's the main drawback of this image,
  • 05:52and so, particularly in the last year.
  • 05:54So that's the trend that a lot of also
  • 05:57publishers are requesting in images.
  • 05:59So the principle is that you need ideally to
  • 06:02show always the data points in every figure,
  • 06:05because you should show the data that
  • 06:08make up your fingers and these for a
  • 06:11barplot means that you have to show.
  • 06:13They did data points.
  • 06:16So here you see an example of how these
  • 06:19bar plot can be represented with the
  • 06:22data points and you see here on the
  • 06:25on the right you see the the single
  • 06:28data points and also you see a summary
  • 06:30statistics that could be for example,
  • 06:33they mean plus minus standard deviation
  • 06:35for the treatment and the end the control.
  • 06:38In general this showing barplots
  • 06:40with only the mean with the standard
  • 06:42deviation is a problem and there was
  • 06:45a publication of five years ago.
  • 06:47The teacher, wife and one issues,
  • 06:49for example,
  • 06:50that the different data distribution
  • 06:52can lead to the same bar block.
  • 06:55You see an example here.
  • 06:57So in a you should use.
  • 06:59You see a barplot representation
  • 07:01of distribution of data and all the
  • 07:04distribution that you see from B2ER
  • 07:06representing could be represented
  • 07:08by that padlock.
  • 07:09So the ideal situation would be
  • 07:11what you see here in plot.
  • 07:14Be where you have data that are.
  • 07:18Symmetrically distributed,
  • 07:19so this is the if the distribution
  • 07:21of your real data is.
  • 07:23These are the bar plot is less problematic,
  • 07:26but for example in C use your situation
  • 07:28where you have an outlier and so
  • 07:31for example this would mean that the
  • 07:33supposed difference that you are
  • 07:35showing in the padlock is not real,
  • 07:38but it's present only because
  • 07:39you have these outlier pointer.
  • 07:41But most of the other data are
  • 07:44overlapping in the two distributions.
  • 07:46Sometimes as you see in the,
  • 07:49this plot could hide some patterns in
  • 07:52the data, so that's what you see here.
  • 07:55In the do you see my cursor?
  • 07:59Yes, OK,
  • 08:00so this for example shows that there are.
  • 08:05The distributions that you see
  • 08:07here are by model.
  • 08:08This could be linked, for example,
  • 08:11to replicate for example,
  • 08:12technical replicates and
  • 08:13biological replicates,
  • 08:14or it could be an important
  • 08:16property of the data.
  • 08:18Nevertheless,
  • 08:18it's something that you cannot
  • 08:20see if you represent with with
  • 08:23a bar plot and the also Bartlett
  • 08:25hide the number of data that are
  • 08:27used to visualize the plot.
  • 08:29The barplot themselves and so
  • 08:31for example in EU situation,
  • 08:33where you have an equal number of.
  • 08:35Points for the black and the white are
  • 08:39Bartlett on the left and the right.
  • 08:42At this is a problem also when you
  • 08:45want to show paired data in barplots.
  • 08:49So again here you see a situation
  • 08:53where a barplot
  • 08:54displays some is the same.
  • 08:57For situations that you see
  • 08:59displayed in BC&D, so be Cmdr.
  • 09:01Very different situations here
  • 09:03you could imagine, for example,
  • 09:04that this data obtained from single
  • 09:06patients at treated with the dragon,
  • 09:09and you measure a parameter of the patients,
  • 09:11and so the information related to each
  • 09:14patient has to be connected so that the
  • 09:17meaning of the of the pair that plot.
  • 09:20So the situation in B shows that the
  • 09:22Dragon has a consistent effect on all
  • 09:25the patients and you can see that.
  • 09:28Calculating for each patient,
  • 09:30the difference between the dots on
  • 09:32the left and on the right give rise
  • 09:35to this to this plot here below,
  • 09:38where all the differences are
  • 09:40positive and are also consistent in.
  • 09:42See you see a situation where
  • 09:44the drug has very big,
  • 09:46very different effects depending
  • 09:48on the patient.
  • 09:50So that the distribution of
  • 09:52the differences is skewed.
  • 09:53And by the way,
  • 09:55this line represents the median
  • 09:57difference that you see for
  • 09:59each patients for the treatment.
  • 10:01And the third plot indeed that you see
  • 10:03has a composition of effects that.
  • 10:05So here you see that the again
  • 10:08the difference is by model.
  • 10:09That means that there are patients
  • 10:11that do not respond to the dragon,
  • 10:14and you see here with the
  • 10:15horizontal lines and some patients
  • 10:17that responded to the dragon.
  • 10:19So the resulting distribution of the
  • 10:21difference as you see here is by model.
  • 10:24The problem with her plots
  • 10:25are and the problem.
  • 10:26Also,
  • 10:27if you use barplots with paired
  • 10:29data is that you don't see any.
  • 10:31Any of this structure so
  • 10:33you you are losing it.
  • 10:34So the best way is always to show
  • 10:36the dots are of your distribution,
  • 10:38maybe together with the bar plots
  • 10:40and if the data are paid also
  • 10:42to show the single connection
  • 10:44with in between the dots.
  • 10:49There is also an issue about about the
  • 10:52choice of displaying the meme of your data,
  • 10:56for example versus the media,
  • 10:58or to show the standard deviation
  • 11:01versus the standard error of the mean,
  • 11:04so mean versus median are ways to represent
  • 11:07summary of the centrality of a distribution.
  • 11:11An the mean is preferable if you
  • 11:14suppose your data are for example.
  • 11:17Symmetrically distributed.
  • 11:18For example, if you assume that the data
  • 11:21has a normal or Gaussian distribution,
  • 11:23while the median represents the
  • 11:25mid is the point that represents
  • 11:28the middle of your data.
  • 11:30The middle of your distribution,
  • 11:32and it's more generally applied
  • 11:34independently from the shaper
  • 11:36of the distribution of the data.
  • 11:38So here you see an example where you
  • 11:41have four different samples population
  • 11:44and you plot the mean plus the standard.
  • 11:48And that's the most conventional
  • 11:50way that you see in publication.
  • 11:52They mean plus standard deviation.
  • 11:55And the median of the population will
  • 11:58receive the single point and the
  • 12:00horizontal bar represents the median.
  • 12:02So an important point about.
  • 12:06Mean versus median is that the
  • 12:08mean and can be used only with
  • 12:11symmetrical distributions.
  • 12:13Otherwise it can be misleading.
  • 12:15While the median is more
  • 12:17generally appropriate.
  • 12:19When you have an outlier like that,
  • 12:21you would always recommend the meat being.
  • 12:24Honey. When you have an outlier
  • 12:27like in the third group there,
  • 12:29yeah, then it makes more sense to
  • 12:31use the median.
  • 12:33Yeah, nobody showed them young.
  • 12:34Is that the median is more robust
  • 12:37data with outliers is totally
  • 12:38more robust with outliers,
  • 12:40and the median is not,
  • 12:42so the presence of over outlier as
  • 12:44you see here in C can shift a lot the
  • 12:48mean while the median is is affected,
  • 12:50but not so much.
  • 12:53Especially from the magnitude
  • 12:55of the outlier,
  • 12:56I would say. So
  • 12:58tomorrow question right?
  • 12:59So so also, being you know,
  • 13:02knowing there's a difference
  • 13:03between me and a medium,
  • 13:05but one of the things I heard,
  • 13:07of course, haven't looked
  • 13:09myself into this deeply enough.
  • 13:11Is that for the meeting the distribution,
  • 13:13unlike mean not necessarily follows
  • 13:15a Gaussian or normal distribution,
  • 13:17so that from a statistical point of view
  • 13:20is going to be a little hard to calculate,
  • 13:23certain significance etc.
  • 13:24Based on medium data.
  • 13:26Is that true?
  • 13:27Or it's simply a misnomer?
  • 13:30How to calculate the
  • 13:32significance of differences?
  • 13:34That's a different.
  • 13:36So that's the difference of the approach.
  • 13:39If you choose parametric test,
  • 13:41such as the tester or the ANOVA
  • 13:44and those tests assume that the
  • 13:47distribution is Goshen is normal.
  • 13:49Yeah, so you need to be careful so he is
  • 13:52usually if it is a repeated measures.
  • 13:55So if you're testing repeated
  • 13:56measure yes soon the error is is
  • 13:59is distributed in a Goshen way,
  • 14:01but that is not always the case.
  • 14:03For example,
  • 14:04if you're comparing two population of
  • 14:07jeans with a signal for each gene.
  • 14:09Just have to check it.
  • 14:11Well, so so this is something that I
  • 14:13think will be particularly important
  • 14:15for experimental scientist, right?
  • 14:17Because you know, as an experiment
  • 14:18is when we are trained, we know OK,
  • 14:21when we did design experiment,
  • 14:22we do service replica so we can
  • 14:24join error bar without thinking
  • 14:26Y and how to deal with it.
  • 14:28And if you go to a statistician
  • 14:30that will tell you say oh look,
  • 14:32if you're going to use the test,
  • 14:35you have to show me first that this is
  • 14:37actually largely a normal distribution
  • 14:39before you can actually use the T test.
  • 14:42Whereas the vast majority
  • 14:43of people in the lab,
  • 14:45that's not how they will
  • 14:46think about in the 1st place,
  • 14:49and they also not trendy enough
  • 14:51to think you know how to prove
  • 14:53or disprove that's the case.
  • 14:55So what would you suggest,
  • 14:57especially when we're doing
  • 14:58experiment that you cannot do
  • 15:00200 replicas for each experiment.
  • 15:02So what would be a good
  • 15:04approach in that regard?
  • 15:06Yeah, so there is a tradeoff between
  • 15:09the ideal situation where the ideal
  • 15:11situation would be always to have
  • 15:14enough data points so that you can
  • 15:16understand the shape of the distribution
  • 15:19and the real case scenario with you
  • 15:22can do as many replicates as you can,
  • 15:24and so usually you have to assume
  • 15:27that the distribution is normal, so.
  • 15:33Ideally, you should always check her.
  • 15:36And again, if we are repeating measures
  • 15:39and you are collecting a measure
  • 15:41of the same data in a repeated,
  • 15:44that way you can assume that
  • 15:46if the error is stochastic,
  • 15:48it should be normally distributed.
  • 15:50So you assume that the distribution of
  • 15:52the error is Goshen, and it makes sense.
  • 15:55But for example in other situation
  • 15:57where you have a lot of measurements
  • 16:00and measurements of different entities,
  • 16:02for example, the expression of
  • 16:04different genes we're doing like.
  • 16:07A compilation of Jesus.
  • 16:08Then these assumption is less probable,
  • 16:11is less likely,
  • 16:12and you should have enough data points
  • 16:15so that you can switch from parametric
  • 16:18tests one on parametric, so we're not.
  • 16:23That doesn't make assumption of
  • 16:25on the underlying distribution.
  • 16:26It is, for example,
  • 16:27they will cook some test or the
  • 16:30Mann Whitney test.
  • 16:31And the problem is that you need them
  • 16:34or replicates because if the end is
  • 16:36the size is less than five, you don't.
  • 16:39You cannot reach the statistical
  • 16:43significance as it is accepted below 0.05,
  • 16:47but it's usually the more correct way.
  • 16:53Then they the standard is not to use that,
  • 16:55and so I remember there was a case
  • 16:57where the paper was in review.
  • 16:59It was from.
  • 17:02Young bean and I remember we performed
  • 17:05the Wilcoxon test and the reviewers
  • 17:07as to why we didn't do the parameter
  • 17:10test so so they asked for the opposite.
  • 17:13They asked us to go against the
  • 17:15ideal situation.
  • 17:18I think this is very helpful.
  • 17:20I think it's really,
  • 17:21you know telling about me,
  • 17:22especially for people who
  • 17:24are not familiar with with.
  • 17:26Test and also the the World Cup test.
  • 17:29I think it's really suggest
  • 17:30you to look into that.
  • 17:32Things can be very helpful.
  • 17:33Yeah, and and obviously you're
  • 17:35so it's important when you plan.
  • 17:37If you if you can to have enough data
  • 17:40points to perform a nonparametric test.
  • 17:43In high throughput
  • 17:44experiments that they see now,
  • 17:45for example single cell that's
  • 17:47not there anymore problem because
  • 17:49you have usually a lot of data
  • 17:51points and so that's less of a
  • 17:53problem that sometimes we work.
  • 17:55Is it after him because they thought
  • 17:58the number of data are increasing?
  • 18:00And not
  • 18:01that generic comment comma. These people
  • 18:04are a lot of these lot of our group is blood
  • 18:10hematology researchers. Yeah,
  • 18:11and neither blood nor blood advances require.
  • 18:16The investigator in their papers
  • 18:17to show all the
  • 18:19data points. And now
  • 18:21I'm on the publication committee.
  • 18:22We've actually talked about this,
  • 18:24but we go by the Journal of Cell Bio.
  • 18:28Instructions to authors and prep for figures,
  • 18:30and there are Rockefeller Press
  • 18:32publication. And they
  • 18:34haven't. So they have genome research
  • 18:36and germ cell bio Med and stuff,
  • 18:38so they haven't come around to making
  • 18:41people show all their dots etc.
  • 18:43But a number of journals, as you know,
  • 18:46half like JC I you know JC AI
  • 18:50advances etc. They might not
  • 18:51even review your paper if
  • 18:53you show, for instance,
  • 18:54your plots on the left here.
  • 18:56Well, they might, you know,
  • 18:57might not even go out
  • 18:59for review. The pre reviewer's will say
  • 19:01you know your figures are inadequate
  • 19:03for our instructions, authors etc etc.
  • 19:05So I think some journals are
  • 19:07coming around to this is the way
  • 19:09we really want to see the data.
  • 19:12Yeah, I think there is a shift
  • 19:15in the paradigm, let's say,
  • 19:16and it will take years.
  • 19:18But for example, I have a slide here
  • 19:21where so this is from my experience,
  • 19:24so that for example,
  • 19:25all the family of the network journals
  • 19:28have already this policies for the figure.
  • 19:31So this is something I received after
  • 19:34the review of a paper as an editorial
  • 19:38guidelines and the food for these like.
  • 19:41Policies that I had to change a
  • 19:44lot of figures and you see that.
  • 19:47And so that the one of the
  • 19:49policy as you see here,
  • 19:50the last one is that for sample
  • 19:52size that are less than 10.
  • 19:54And they want you to get to plot
  • 19:57the individual data points and
  • 19:58so they don't accept bar graphs.
  • 20:00Got bargraphs anymore.
  • 20:03And then, for example,
  • 20:05if you have some statistics such as
  • 20:08error bars with the lesson 3 replicates,
  • 20:11you have to remove,
  • 20:13remove them and you have to show
  • 20:15to show the data without the
  • 20:18statistics without the error.
  • 20:20Then this also is a point that
  • 20:23you usually is not satisfied.
  • 20:25So when you plot some statistical
  • 20:28significance values,
  • 20:29they don't accept anymore,
  • 20:31they start the stars.
  • 20:33But you have to provide the
  • 20:35precise P value in the figure.
  • 20:38It means that you have some stars.
  • 20:40You have to change the stars that
  • 20:42converting start to the precise P
  • 20:45value before before publishing and
  • 20:47then also you have to provide the
  • 20:49precise number size for each of your bars.
  • 20:52For example,
  • 20:53I mean I,
  • 20:53I think in the past it was enough
  • 20:56to provide a range like from
  • 20:58three to six replicates,
  • 21:00but now they really want the number for each.
  • 21:04For each app and population,
  • 21:05for each sample that you have.
  • 21:08So these are,
  • 21:10in my experience were something
  • 21:12that I had to provide that,
  • 21:15but after the radio so it was not.
  • 21:18It was the editorial like.
  • 21:22At stage of acceptance of the paper,
  • 21:24and I think this is true now for all
  • 21:28the families of the of the natural.
  • 21:32Jordans
  • 21:34it can I add something?
  • 21:36Although this is only for
  • 21:38publication that goal of publication,
  • 21:39but it's important that we start
  • 21:42practicing all these rules in
  • 21:44our daily life because it's so
  • 21:46painful that you have to do this
  • 21:49when you you're trying to get
  • 21:51the figures into the Journal.
  • 21:53It's a lot easier to do it while you're
  • 21:56making the figures in real life.
  • 21:59Yeah, so obviously it
  • 22:01says worker before there.
  • 22:02Yeah it says work because otherwise
  • 22:05you have to repeat all day. Fevers so
  • 22:10yeah also echo that,
  • 22:11and also just want to say that you know
  • 22:14I used to just use Excel to placings.
  • 22:17But since my many of my lab members
  • 22:19start to use Graphpad prism to plot,
  • 22:22that makes a huge difference in
  • 22:24converting between different types
  • 22:25of parts such as this kind of things.
  • 22:28If you had a bar bar graph,
  • 22:30Indiana in that software,
  • 22:31then you can very easily change that to a
  • 22:34bar graph with different dots distributed.
  • 22:36So it's very easy to work with.
  • 22:40Yeah, that's also I have something
  • 22:42at the end of the presentation.
  • 22:44So basically there are a lot of tools now
  • 22:46more or less commercial, but tequila.
  • 22:49They aren't really available.
  • 22:51U as which are too many different formats
  • 22:55and starting with the same initial data,
  • 22:58basically formatted as a table.
  • 23:01So that from the same table you can switch
  • 23:03to there too many different visualizations.
  • 23:06So that's that's true,
  • 23:08and it's probably easier also to plot these
  • 23:11dots with single dots as it was in the past.
  • 23:15Without respect.
  • 23:18OK, so that was the main point of this part.
  • 23:22I had a part on the standard
  • 23:24deviation standard error.
  • 23:26That's another issue because the
  • 23:28standard error is basically the
  • 23:29standard deviation divided by the square
  • 23:32root of the number of experiments,
  • 23:34and so usually the standard
  • 23:36error is displayed.
  • 23:37But you have just be careful that it's
  • 23:40a measure that tends to go to zero
  • 23:43just because they increase the number
  • 23:45of replicates or the number of points.
  • 23:48So you see an example here where
  • 23:50it seems by plotting the standard
  • 23:52error that the black bar and the
  • 23:55white bar have the same like measure
  • 23:58of spread of the data.
  • 23:59But if you look at the standard
  • 24:02deviation you see that this is
  • 24:04an effect of the factor.
  • 24:06Today the Black bar has higher spread,
  • 24:08but also more points,
  • 24:10and that's why the standard
  • 24:12error seems seems the same.
  • 24:16So that's another another issue.
  • 24:18So obviously for publication at the
  • 24:20standard error of the mean is preferred,
  • 24:23because it usually gives an impression
  • 24:26of the data being less sparse.
  • 24:30But especially with different
  • 24:31number of samples in different
  • 24:33in different bars that it could.
  • 24:35This could be misleading.
  • 24:40And all these issues were presented
  • 24:42in these in this paper published
  • 24:44five years ago in in plus biology.
  • 24:49I would skip this,
  • 24:50just that we will touch this later,
  • 24:52but an alternative solution if
  • 24:54you have enough data points.
  • 24:56So I would say more than 10.
  • 24:59An alternative solution instead
  • 25:01of showing like but lotsa Ann
  • 25:03is to show the distribution of
  • 25:05the data is box whisker plot.
  • 25:07As you see here they have some light
  • 25:11model with more details on this.
  • 25:14OK, so this is the next example.
  • 25:17I think it's a biplot with the
  • 25:19usage of the different browsers,
  • 25:22so this is extra science image so.
  • 25:26So this is a classic example
  • 25:29in like visualization lessons.
  • 25:31So what could we run with this?
  • 25:39There's no end. Yeah, so that's a yes,
  • 25:44so there is no endless so that you cannot.
  • 25:47You don't know of how many,
  • 25:50how many data points you use that in
  • 25:52order to build the other frequencies.
  • 25:55Obviously pie charts are used to display
  • 25:59frequencies and proportions of some
  • 26:02classes that sum up to 100 or or to one.
  • 26:05The main problem is that so
  • 26:07the idea is that you shouldn't.
  • 26:10You should avoid by chance.
  • 26:12So the idea for displaying an
  • 26:15information of our proportion or of
  • 26:17a percentage as a pie chart are is.
  • 26:21Not the best choice.
  • 26:23Because that it was shown that humans
  • 26:27are very bad at reading angles,
  • 26:30so we're not very precise,
  • 26:32precise in understanding differences between
  • 26:35angles and so between the designs of the.
  • 26:39Slices of the pie and so usually if you
  • 26:44convert the pie chart into a bar plot.
  • 26:47Information is much more clear.
  • 26:49It's true that the pie
  • 26:51chart is more aesthetic.
  • 26:53Appeared, but the bar plotter
  • 26:55is in in any circumstances,
  • 26:57usually more affecting in displaying girl.
  • 26:59For example, differences in the
  • 27:02usage of this genome browsers.
  • 27:04So this has been a long issue and if you
  • 27:07in many presentation so there is always
  • 27:10this suggestion to avoid at all a pie charts.
  • 27:13There are also some example of these.
  • 27:15So these are three pie charts and you can see
  • 27:19that it's they are different from each other.
  • 27:22But it's very difficult to
  • 27:24understand that the difference,
  • 27:25so the difference is is in the size
  • 27:28of the slice of the three pies,
  • 27:31but it's very different.
  • 27:32For example, to understand in each
  • 27:35pie which one is the largest slides.
  • 27:38And to draw comparison it much more
  • 27:41more easier to understand these issues.
  • 27:44So which pie is larger if the information is
  • 27:48not displaced is not displayed as pie charts,
  • 27:51but as market.
  • 27:54So that's on the web.
  • 27:56I also found these provocative.
  • 27:59Label of pie charts as lighters.
  • 28:03So in general it would be better to avoid
  • 28:06displaying information as pie chart.
  • 28:08And prefer a bar chart instead
  • 28:10to show the same information.
  • 28:14OK, so that was faster.
  • 28:16This is another example.
  • 28:18What could be wrong with this plot?
  • 28:20Again, we have a treatment.
  • 28:21We have a control.
  • 28:22This time we see the data point.
  • 28:26Scale is so wrong, so it covers
  • 28:29the distribution of the lower end.
  • 28:31Yes, exactly so this is a case where
  • 28:34most of the data are compressed,
  • 28:37since they have very different magnitude.
  • 28:39Most of the data are compressed
  • 28:42air in a very small part of the
  • 28:46plot and we cannot understand.
  • 28:48Very much how they are distributed
  • 28:51because most of the plotter is related
  • 28:53to these kind of two outliers.
  • 28:56So this is an issue with the measures
  • 28:58that have different magnitudes,
  • 29:00so it could in my experience it happens.
  • 29:03For example in gene expression measurements.
  • 29:07Because they can vary,
  • 29:08especially with the sequencing.
  • 29:10They can value of four to five.
  • 29:13Magnitude and the main way to solve this
  • 29:16issue is to log transform the data.
  • 29:19So instead of plotting in a
  • 29:21linear scale to log normalizing,
  • 29:24the scale of the data and this
  • 29:26allows to restrict the distance
  • 29:28between these two points,
  • 29:30the outliers,
  • 29:31and allow you to see also the
  • 29:34distribution of the points.
  • 29:36That here seems all compressed.
  • 29:40So usually log transformation allow you to
  • 29:43capture some information on the difference
  • 29:45of your points that are more clear.
  • 29:48Not in all cases, but in some
  • 29:50cases rather than displaying the
  • 29:52information in a linear scale,
  • 29:55especially when you have a lot of range
  • 29:58between your minimal and maximal.
  • 30:00Measurements. An alternative way.
  • 30:04Is not also a panel breaks,
  • 30:07so personally I prefer log log
  • 30:10transformation over panel breaker because
  • 30:13there is mathematically more likely.
  • 30:16Linear or elegant,
  • 30:18but there are situations where you can.
  • 30:20You can choose so this is an example.
  • 30:24You have a bar chart.
  • 30:26You have a huge difference between
  • 30:28the measurements of a 2D and E&F.
  • 30:31So this is how you solve the problem by
  • 30:35introducing a breaker in your panel.
  • 30:38So from 25 to 200 to 210 and this is the
  • 30:41equivalent solution by log transformation.
  • 30:44As you see,
  • 30:46the solution that the two solutions
  • 30:49give a fight a similar result.
  • 30:51But here you insert the manual break of
  • 30:54the data and this could be misleading.
  • 30:57Here you saw the issue by log
  • 31:00transforming all the measurements.
  • 31:02So this is for example is an advantage
  • 31:04because it affects all the measurement.
  • 31:07And while this panel breaker
  • 31:09affects only for example,
  • 31:11these two bars and could distort the data.
  • 31:18Another another scenario where you should
  • 31:22consider log transformation is these.
  • 31:26This could be a plotter that shows for gene
  • 31:30expression levels from Aaron Isike for.
  • 31:33Population of jeans.
  • 31:35So each gene could be a doctor
  • 31:37and he received it.
  • 31:39There is a different year age,
  • 31:41but you see that there are outliers like
  • 31:44for example genes of ribosomal proteins.
  • 31:46Histones usually are in these.
  • 31:49Are in this part of the plot,
  • 31:52but most of the gene are 90% of
  • 31:54your jeans are in this part of the
  • 31:56plot and you cannot really see.
  • 32:01You cannot really inspect them
  • 32:03because most of the plot is
  • 32:05dedicated to some outliers.
  • 32:07So again, here is a situation
  • 32:09where you can log transform.
  • 32:11Both are the coordinates.
  • 32:12So let's say that here is the
  • 32:15control and this is the treatment
  • 32:17and this will allow you to see more
  • 32:20in detail the differences in the
  • 32:22expression of the bug of your jeans.
  • 32:28In a situation like Visa,
  • 32:30you should also consider
  • 32:32issue if you're interested,
  • 32:34for example in showing differences in
  • 32:37expression between 3 between a control.
  • 32:39For example, are one and
  • 32:41the treatment are two.
  • 32:43You have also the possibility to show.
  • 32:48As the Y axis,
  • 32:50the differences in the log values.
  • 32:52So this representation
  • 32:53here is the same as this,
  • 32:56but it maximizes the visualization
  • 32:58of the differences in the
  • 33:00expression levels of genes.
  • 33:01So this is something that you
  • 33:04find a cold as as an MA plot.
  • 33:07It was introduced with the
  • 33:09analysis of microarray data,
  • 33:11but you can find it also
  • 33:13with sequencing data.
  • 33:15Sometimes these two different
  • 33:16visualization are used.
  • 33:17Depending on the aim of the figure,
  • 33:20so sometimes you will find these,
  • 33:21especially when the message of the
  • 33:23figure is that you don't see big
  • 33:25differences between the two conditions,
  • 33:27while if the message is that you find big
  • 33:29differences between the two condition,
  • 33:31you will find mostly these visualization.
  • 33:35So here I would just point out that
  • 33:37in any at any sequencing experiments,
  • 33:40you will probably never find any gene
  • 33:44that is in this area because they.
  • 33:48But most of the genes,
  • 33:50the main difference they make
  • 33:51the main like determinant,
  • 33:53is the basil expression levels.
  • 33:55So usually your perturbations do not
  • 33:57affect so much the expression of a gene,
  • 34:00so that the gene is in these
  • 34:03area of the plot of the oranges.
  • 34:06And that's why this visualization
  • 34:08is much more efficient in capturing
  • 34:10the expression differences.
  • 34:12Because they scale on on the
  • 34:15expression at baseline.
  • 34:19OK, so now I have a section I don't
  • 34:21know that I'm I have a section
  • 34:24about how to display distributions.
  • 34:29So let's say that
  • 34:29we have a display. One time you had 15
  • 34:32minutes and if we go a little over, that's
  • 34:34OK, OK? So when you have to
  • 34:37represent the distribution of data,
  • 34:39you have many choices.
  • 34:40The histogram is one of the most used choice.
  • 34:44It has the advantage that it can present.
  • 34:48With detail, the shape of the
  • 34:51distribution of your data.
  • 34:53And so basically you have a variable of
  • 34:56interest that usually is a continuous
  • 34:58variable and you wanted to show
  • 35:01how this variable is distributed.
  • 35:03So you divide the range of the values in
  • 35:06some beans and then you count the number
  • 35:09of points that fall inside each being.
  • 35:12The issue with the histograms is
  • 35:14that you should be careful when when
  • 35:17building the histograms and when looking
  • 35:19at the histograms that there are
  • 35:22some are being arbitrary parameters.
  • 35:24In building up his histogram,
  • 35:26mainly the choice of the bin size.
  • 35:30So this is an example where the same
  • 35:33distribution of data that is the
  • 35:35distribution of the price of abedy
  • 35:38apartments in French City has been
  • 35:40being there in two different ways.
  • 35:43So here is the price and hear the bin sizes.
  • 35:47So the size of each of the bin is 10.
  • 35:52Dollars.
  • 35:53While it in here on the writer it is
  • 35:57of $2 so you can see that using more
  • 36:00granular bins allow you to see some
  • 36:04the presence of some accumulations
  • 36:06in your data that you cannot really
  • 36:09see with the larger bin size,
  • 36:11and this could be important because
  • 36:14these accumulation this probably
  • 36:16are accumulation of price that are
  • 36:19due to the fact that they are prices
  • 36:22that are commonly used.
  • 36:23By many different Airbnbs, for example,
  • 36:26because they are multipliers of 50 or 100,
  • 36:29for example.
  • 36:30But the fact is that depending on
  • 36:33the choice of the bin,
  • 36:34you see a different story.
  • 36:38And then you should be always
  • 36:42careful to select bin size.
  • 36:45That doesn't affect too much data.
  • 36:49There are also software tools
  • 36:51that calculates depending on
  • 36:53your data depending on squared,
  • 36:56your points are placed the best
  • 36:58and size of the bins so that you
  • 37:02reduce the distortion of your data.
  • 37:09An alternative way to represent
  • 37:10distribution is to use a density plot.
  • 37:13So a density plot is basically
  • 37:15a smoothing of a histogram.
  • 37:18Here you collect being said and here
  • 37:21use motor the shape of the distribution
  • 37:23so that you have a continuous function.
  • 37:27This is graphically nice.
  • 37:30And it allows to compare,
  • 37:31for example distribution of
  • 37:33two variables as you see here
  • 37:35in green and in and in Violet,
  • 37:37and the advantages that you can see also
  • 37:40complex shapes of the distribution.
  • 37:42For example here the bimodality
  • 37:44or hear the presence of this show
  • 37:46is that of the distribution.
  • 37:48The pitfall is similar to the histogram,
  • 37:51so you should always be careful
  • 37:53in selecting the.
  • 37:54How much is Martha the distribution?
  • 37:56And here you see an example.
  • 37:58So these are the.
  • 37:59Points that were used at the single
  • 38:02points that were the that were used
  • 38:04in order to build the distribution.
  • 38:07They were randomly chosen from a normal
  • 38:10distribution and you can see that.
  • 38:12Problem is similar to the bin size,
  • 38:14so here you have to select basically.
  • 38:19A wavelength in order to approximate
  • 38:21that the function to a curve
  • 38:23and depending on the wavelength,
  • 38:26the resolution of the
  • 38:27wavelength that you choose.
  • 38:29The result is different,
  • 38:31so you could have this kind of plot
  • 38:34that seems to show a lot of local pixel,
  • 38:38but by smoothing more you have
  • 38:41instead the normal distribution
  • 38:43from which you draw the data so.
  • 38:46There is a balance which appear
  • 38:48in choosing beings that are two
  • 38:51larger or hear excessive smoothing.
  • 38:53Because these over simplifies
  • 38:54the original distribution,
  • 38:55but on the other side,
  • 38:57if you take a resolution that is too small,
  • 39:01too granular,
  • 39:02you can obtain that strange effects.
  • 39:04So you could see for example,
  • 39:06pics that are depending on the
  • 39:09extraction of random numbers.
  • 39:11Again,
  • 39:11also in this case there are softwares
  • 39:15that given the the original data,
  • 39:18your original vote data can calculate the
  • 39:22optimal smoothing wavelength in order
  • 39:26to avoid distortions based on your data.
  • 39:29A compact way to represent the
  • 39:31distribution is the box whisker plot,
  • 39:34and here you can see how a box
  • 39:36whisker plot they can be obtained
  • 39:39by this distribution of 20 points.
  • 39:41So basically the box whisker plot
  • 39:43represents as a box 50% of the data
  • 39:46of the distribution to.
  • 39:48Usually you have a central line
  • 39:50that is the media.
  • 39:51It's important,
  • 39:52not laminar,
  • 39:53but in the box whisker is always the medium.
  • 39:56This is the first quartile
  • 39:58and the third quartile.
  • 40:00420 Percent 25th percentile of the data.
  • 40:0375th percentile of the data.
  • 40:05So in the box you have 50%
  • 40:07of your day to the central.
  • 40:09Here 50% of your data.
  • 40:11Then you have the whiskers.
  • 40:14They are standard definition of the
  • 40:17Whisker Lanka is that they are as
  • 40:20long as the interquartile range.
  • 40:22That's the distance between Q1 and Q 3 * 1.5.
  • 40:27And you see these as the whisker
  • 40:30of your plot.
  • 40:32So these collect most of the
  • 40:34distribution of your data.
  • 40:36The data that are outside the whiskers
  • 40:38are considered to be outliers.
  • 40:40For example,
  • 40:41here you see there these three points.
  • 40:44They are outside the whisker size,
  • 40:46and so these usually are individually
  • 40:49displayed in the whisker plot and are
  • 40:52considered to be an outlier according
  • 40:54to this definition of the whiskers.
  • 40:57Yes,
  • 40:58if you wanted
  • 40:59to make these plots, yeah,
  • 41:00is there an easy way to do it
  • 41:03or do you like you personally,
  • 41:05just do it by in R or something?
  • 41:08Well, box plot. I don't think
  • 41:10you can do them with Excel,
  • 41:13but for example with Prisma
  • 41:15or Origin you can totally.
  • 41:20I think the only limitation is is
  • 41:22Excel, but I be honest, I didn't
  • 41:24check the last version of Excel.
  • 41:27Right for us to think about,
  • 41:28you know we can we have our data and there
  • 41:31are many different ways of plotting it,
  • 41:33but it sounds like prison might be the
  • 41:35way to go in to try to do it in less.
  • 41:37You're somebody like you.
  • 41:38Who knows how to put it into our.
  • 41:41Yes, probably, so please MA is it?
  • 41:45Give you an option that is much.
  • 41:47Use that if usually use them
  • 41:48originally with respect to Prisma.
  • 41:50I think it has more.
  • 41:52I'm more power,
  • 41:54so there are more things that you
  • 41:56can do with origin then please MA.
  • 41:59I think because it was designed
  • 42:01for the for the physics community,
  • 42:04but the tradeoff is always complexity,
  • 42:06so please May is has less power,
  • 42:09less choices, but it's easier
  • 42:10to use rather than than origin,
  • 42:13but both share the same philosophy
  • 42:15so that you need to provide the data
  • 42:18is a spreadsheet format and they are
  • 42:21available in the software library at.
  • 42:24OK, thank you to my can you say
  • 42:27the name of the other not prism
  • 42:29but the other programming?
  • 42:31Or I have a slide after whether you show
  • 42:34its origin? OK, thanks yeah.
  • 42:37Ava question so,
  • 42:38so my initial understanding is that
  • 42:40the whisker lenses representing the
  • 42:4295 percentile of the data range.
  • 42:45But here it says the whisker
  • 42:47length is 1.5 times this IQR lens.
  • 42:50But if that's the case,
  • 42:52why would the left side of
  • 42:54the screen right side of risk
  • 42:57are having different lens?
  • 43:02Um? So that could be for example
  • 43:06because here you have the, so that's
  • 43:09the the maximal length of the whisker.
  • 43:12But if the minimum of your
  • 43:14data that is here is here,
  • 43:17the whisker stops. So that's why.
  • 43:19So I see here you have outliers and
  • 43:22so that we can extend to the maximum
  • 43:25point that is 1.5 at this measure.
  • 43:27But if you before the the maximal distance
  • 43:30here you meet the minimal pointer,
  • 43:32the whisker and there and
  • 43:34there so that's why.
  • 43:36OK, I see it's also true that these
  • 43:39whisker definition can be customized,
  • 43:41so this is the default interpretation.
  • 43:43I don't know who who decided this.
  • 43:46I don't have the original publication,
  • 43:48but you can choose whiskers to
  • 43:50be differently, so that's why.
  • 43:52Also in the Network Journal
  • 43:54paper when you do a box plot,
  • 43:56you have always to specify in the statistical
  • 43:59methods how you design your box plot.
  • 44:02So you have to provide how,
  • 44:04for example, the skirts were defined.
  • 44:07Because sometimes it's true that,
  • 44:08for example,
  • 44:09the whisker can represent like
  • 44:1195% of the distribution.
  • 44:13Right, so this is just the default,
  • 44:15but it can be customized,
  • 44:17so there are different choices.
  • 44:21I have a question regarding the
  • 44:24distribution again, maybe it's in
  • 44:26continuation to what you just said.
  • 44:30Some softwares allow a default value
  • 44:32for the bin size and for the smoothening
  • 44:36and all that say like Matlab that
  • 44:38I've been trying to put this into.
  • 44:41How reliable do you think that is?
  • 44:44The default values and how would you suggest?
  • 44:48Most of the time,
  • 44:49most of the time, so I don't.
  • 44:52I don't have experience with matter,
  • 44:53but probably it will be that it's the
  • 44:56same in our so so most of the time
  • 44:58there is a sort of optimization there,
  • 45:01so most of the time is fine. Uh, but.
  • 45:07Sometimes, especially if you
  • 45:09have a distribution of data,
  • 45:11but you also have a pointer
  • 45:14with cumulation of data.
  • 45:16You could have problems in the.
  • 45:20In the blocker so.
  • 45:22But I don't have an example.
  • 45:25OK, so like in 95% of the time I'm OK
  • 45:30with the with the solution that is
  • 45:33provided by the MATLAB or RA building tool.
  • 45:38For example, sometimes when you compare
  • 45:40to distribution with a different size
  • 45:42with a different number of points,
  • 45:44that could be that that can be a problem.
  • 45:48Because sometimes there.
  • 45:50If you're comparing for example
  • 45:52distribution with 10 points with
  • 45:54a distribution of 1000 points.
  • 45:56Adopting the same wavelength
  • 45:58could be a problem,
  • 45:59and so you need to manually change it.
  • 46:03So that's the yes,
  • 46:05but that's that probably could be a.
  • 46:08A practical example on when it's not ideal.
  • 46:12Because the software,
  • 46:13if you are trying to compare a
  • 46:1610 points versus 1000 points,
  • 46:18tries to define a common wavelength.
  • 46:21But sometimes this leads
  • 46:23to like distorted images.
  • 46:25I don't have an example to show.
  • 46:29That's good enough, thank you.
  • 46:32And well, I can leave the note.
  • 46:35Sometimes you can see also the
  • 46:37nutshack inside your box whisker,
  • 46:39so they're not sure is diesel
  • 46:42feature that it represents a measure
  • 46:44of certainty for the medium.
  • 46:47So sometimes it is useful to have.
  • 46:49These are 'cause if you are comparing
  • 46:51a lot of box whisker plots a you can
  • 46:54look at the uncertainty as if it was a
  • 46:56sort of standard error of the media.
  • 46:59And so if two box whisker overlapping,
  • 47:02they're not.
  • 47:03She's probably it means that the
  • 47:05medians are not statistically.
  • 47:08Inefficiently different.
  • 47:09This could be a way to.
  • 47:12The use of the notch or there is
  • 47:15the interpretation of the data
  • 47:16and the comparison of different
  • 47:18distribution and that's why the
  • 47:19box whisker plots are so popular,
  • 47:21because they allow you to represent
  • 47:24that distribution of data in
  • 47:25a very compact format.
  • 47:27This is another display of the
  • 47:29anatomy of Big box whisker,
  • 47:31but it doesn't add anything
  • 47:33that I had also before.
  • 47:35So here is an example where box whisker
  • 47:38plots are used in order to compare
  • 47:42the four different distributions.
  • 47:43So the advantage is that they allow
  • 47:46easy comparison so it's easy to
  • 47:49compare the distribution of ABC and D.
  • 47:52The problem they can have is that they
  • 47:55hide the shape of the distribution.
  • 47:58And also usually they hide the
  • 48:01number of points that were used
  • 48:03to build the box whisker.
  • 48:05Sometimes you can code the number
  • 48:08of points so the cardinality the
  • 48:10size of the distribution as the
  • 48:13width of the box whisker,
  • 48:15but it's rarely used because it's not
  • 48:18very visually beautiful, I would say.
  • 48:21So one solution it could be to
  • 48:23overlay over the box whisker,
  • 48:26plot the jitter plot,
  • 48:27so jitter plot represents the single
  • 48:30points that were used to build
  • 48:32the box whisker plot and they are.
  • 48:34So while on the Y axis that there
  • 48:36is the precise values on the X
  • 48:39axis there randomly.
  • 48:43Place that let's say there are
  • 48:45also methods that do not display
  • 48:47these points randomly butting up.
  • 48:49Sell the random way that captures
  • 48:50the shape of the distribution,
  • 48:52and I think that that kind of plot
  • 48:55is also present in in graph for
  • 48:58the Prisma so the advantage of this
  • 49:00is that you can see, for example,
  • 49:03that would be the distribution is bimodal.
  • 49:06So because you see that there are
  • 49:08these high densities of points and
  • 49:10the box whisker plot cannot capture
  • 49:12that you cannot see from a box,
  • 49:14whisker plot data distribution is
  • 49:17bimodal and for example here you
  • 49:19can see that these box whisker
  • 49:21plot there has been is based on
  • 49:24much less data than the others.
  • 49:26So, uh, and a solution for these
  • 49:29are is to enclose the box whisker
  • 49:32plot into a violin plot.
  • 49:35So violin plot representation
  • 49:37like these allow you to see the
  • 49:40same information of a box whisker,
  • 49:43but also information on this shape
  • 49:46of the distribution is basically
  • 49:49in a violin plot.
  • 49:50You add a density plot that is
  • 49:54parallel to the vertical axis.
  • 49:58And here, by using a violin plot you can see.
  • 50:00That this,
  • 50:01that this distribution is one pick
  • 50:03has one pick. This one is by model.
  • 50:08And you can add also the number here.
  • 50:11He said of coding the number as
  • 50:13the size of the distribution as
  • 50:15the width of the distribution.
  • 50:19So this is an example of compare
  • 50:21of comparisons between different
  • 50:23ways to show distribution.
  • 50:25Here you see the histogram with the density,
  • 50:28corresponding density plot,
  • 50:29the same distribution
  • 50:31visualized as a box plot,
  • 50:33and visualized as a violin plot that
  • 50:36captures both the features of a box
  • 50:39plot cluster the density distribution.
  • 50:42And this is for a normal distribution.
  • 50:44This is for a bimodal distribution where you
  • 50:47can see that the box plot doesn't capture,
  • 50:50so the box plot can capture the fact
  • 50:53that the data are not symmetrical and
  • 50:55you see the for example the distance from
  • 50:58the from the from the point of the box
  • 51:01and the medium is much more than these.
  • 51:04So the box whisker is good in capturing
  • 51:07a symmetrical distributions but not
  • 51:09the presence of more than one piece.
  • 51:11So not the complex shape of the distribution.
  • 51:15And there is a website here where
  • 51:17you can where you can see a lot of
  • 51:21examples where the different choice of
  • 51:23visualization can lead to different.
  • 51:26Conclusion as here.
  • 51:29It's true also that the violin Plata
  • 51:32is not efficient because you're
  • 51:35sure you're showing twice.
  • 51:37The same information,
  • 51:39so this is aesthetically pleasant,
  • 51:42but is not efficient because
  • 51:44you're repeating basically this
  • 51:46density twice above and below,
  • 51:49and so that's why there are two
  • 51:52saver for efficiency sufficiency.
  • 51:55There are recent visualization
  • 51:57strategies as the rain cloud plotter.
  • 52:00So the Raincloud plot that shows a box
  • 52:03whisker plot in the middle half violin
  • 52:06plot here and then also the single point.
  • 52:09So that's probably the one of the
  • 52:12most complete exhaustive ways to
  • 52:14represent a distribution of data.
  • 52:16And they're called the rain cloud because
  • 52:18of this effect is should be the cloud.
  • 52:20And this is the rain that falls
  • 52:22on the proposed below.
  • 52:23So you can find information on
  • 52:25how to block these are following
  • 52:28the following these link.
  • 52:29Another yeah.
  • 52:31Quick question, is there a?
  • 52:35How to say the restriction or limitation
  • 52:38as to how many data points are required
  • 52:43for generating reliable violin plot?
  • 52:50Generally not so. Probably more than 10,
  • 52:55I would say because otherwise so you can
  • 52:58see that you can see it empirically,
  • 53:00because if the data are too few you can
  • 53:03see that the violin basically have sort
  • 53:06of waves around each point of your data.
  • 53:10So as a general. Is a general threshold.
  • 53:15I would say 10 points would be the
  • 53:19like the minimum number. And asking
  • 53:21that question is of course if
  • 53:23you have a lot to data points,
  • 53:26these would be informative.
  • 53:27But if you have, let's say
  • 53:29less than 10 or small number,
  • 53:31this could be really distorting
  • 53:33or faking the. Yeah, yeah, that's
  • 53:35true. That's why I would
  • 53:37say 10 because it below 10.
  • 53:39Probably the best strategy is to show
  • 53:41the single points and then a summary
  • 53:44such as the mean or median plus
  • 53:46some validation standard dialogue,
  • 53:48but not the not the distribution
  • 53:50as a violin plot.
  • 53:51So that's for a like less than 10 data.
  • 53:57Alright, when data are too much,
  • 53:59for example, it doesn't make
  • 54:00sense to show the single points.
  • 54:03Because that they are overlap,
  • 54:04they overlap each other and so you
  • 54:06don't see anything that happens
  • 54:08when you have more than 1000 points,
  • 54:10and so the best solution in that case
  • 54:12is for example to show only the violin.
  • 54:18So there is a Ranger for which.
  • 54:22The best solution is to show the
  • 54:24single data points with the cross bar,
  • 54:27so an element with captures mean or
  • 54:30median plus standard deviation order.
  • 54:33Confidence interval there is a
  • 54:35Ranger that is in the middle from 10
  • 54:38to some hundreds where the violin
  • 54:40plot and the box whisker plot are
  • 54:43the best option to visualize.
  • 54:44And when you have many,
  • 54:47many data more than 1000, probably.
  • 54:48If you want to capture the distribution
  • 54:51then only there the violin plot rather
  • 54:53than the single points is the best way.
  • 55:01Did did he? Did it answer?
  • 55:06Yeah, that was awesome.
  • 55:08That is a great explanation.
  • 55:11OK, another another alternative
  • 55:13way to maximize efficiency of the
  • 55:16violence that I saw a lot in the
  • 55:18with single cell data, for example,
  • 55:20is the the user split violin plots are,
  • 55:23so you use the violin plot to show a
  • 55:26comparison between two distributions.
  • 55:29So you see here are this plot shows the
  • 55:32representation of Asia or female and
  • 55:34males are using different social, social,
  • 55:37media, Instagram, Facebook, Twitter.
  • 55:38So it's a way to show using.
  • 55:41Half of a violin plot are differences
  • 55:44in the distributions and this can
  • 55:46be used when you have a contrast
  • 55:49of two conditions or you want to
  • 55:52compare two distributions.
  • 55:53I'm also in the single cell.
  • 55:57About the violin. Plots,
  • 55:58like in the such cases, yeah,
  • 56:00So what determines the height of the peaks?
  • 56:03Or is that everything is normalized
  • 56:05so that the total area the same,
  • 56:07or the maximum height is the same?
  • 56:10So most of the time,
  • 56:12so you have choices usually so you can
  • 56:16choose to have the same maximum hate.
  • 56:20And that's usually the then.
  • 56:22That's usually what you find,
  • 56:24so you you plot there in a way
  • 56:27that the Ranger is the same from
  • 56:30here to here from here to here,
  • 56:33the alternative is to use the
  • 56:35real criteria for a for a density,
  • 56:38and that should be that the
  • 56:41area under visa is equal to 1.
  • 56:45And so that the two have the same area.
  • 56:48An alternative is to have an
  • 56:50area that is proportional to
  • 56:52the number of observations,
  • 56:54but I think that visually most
  • 56:57of the time you find that.
  • 57:00The criteria is that you have in order
  • 57:02to have balanced plots are the criteria,
  • 57:05is to have the same Ranger.
  • 57:07Meaning from here to the maximum
  • 57:09for all their pull the plot
  • 57:11independently from the area and
  • 57:13dependently from the number of points.
  • 57:18It's not probably the best solution from
  • 57:20the point of view of communication,
  • 57:21but it's most used. OK, thank you.
  • 57:27I variation of this is also the
  • 57:30use of ridgeline plots are that.
  • 57:32They allow you to compare a
  • 57:35lot of different densities.
  • 57:37For example, here you see a comparison
  • 57:40of the density of temperatures in
  • 57:43different month in allocation metadata.
  • 57:45Remember Lincoln NE and this
  • 57:48is used in a single cell.
  • 57:52Is Alotta now in these years
  • 57:54with single cell data?
  • 57:55For example,
  • 57:56here you see that it is used to
  • 57:58compare the distribution of the
  • 58:01expression of 1 gene leads A
  • 58:03or CL5 in different population
  • 58:05of cells that are probability
  • 58:07can from some blood sample.
  • 58:10Different population and
  • 58:11these allow you to see.
  • 58:13Sorry to see an marker genes or to
  • 58:16see how the expression of a gene is
  • 58:19specific for a population of cells.
  • 58:22So that's why I included because I
  • 58:25see that the frequency of this plot,
  • 58:28specially in the single cell
  • 58:31visualization field is quite increasing.
  • 58:34I have visa section of the
  • 58:36presentation that we could skip.
  • 58:38In general the message about Visa
  • 58:40is that Venn diagrams are good
  • 58:43when you have two Venn diagrams,
  • 58:45but if they are,
  • 58:46if they're more there a bad way to
  • 58:49represent intersections between sets.
  • 58:51And this actually is a plot that
  • 58:54was published in Nature and it it's
  • 58:56about a comparison of the genome
  • 58:59of banana with other species.
  • 59:01So the problem in general is that
  • 59:04when you have more than two, 3,
  • 59:06four but also two Venn diagrams,
  • 59:09it's it's not the best way to
  • 59:11visualize intersection with the use
  • 59:13of the traditional Venn diagrams.
  • 59:15So a table is probably more effective
  • 59:18than this because the areas are
  • 59:21not proportional to the size.
  • 59:23And it's quite confusing to see
  • 59:26the specific intersection and
  • 59:28so on alternative way.
  • 59:30That was developed in the recent year
  • 59:32was the user the concept of this
  • 59:35upset plots are so to represent the
  • 59:37intersections in a matrix format.
  • 59:40So represent these are as a member
  • 59:42as a sum object.
  • 59:44Example,
  • 59:44a gene that is present on only
  • 59:46List A only list D only list C
  • 59:49intersection between AMD origin
  • 59:51present in all the intersections.
  • 59:53So you can use these matrix format
  • 59:55to show the intersections and then
  • 59:58you can display the cardinality.
  • 01:00:00Of each.
  • 01:00:01Intersection so the number of genes,
  • 01:00:03for example that are only in the PDF
  • 01:00:06error pathway that you see here.
  • 01:00:08The number of genes that are in the
  • 01:00:11common between the EGFR and P-10 path.
  • 01:00:14With that you see here.
  • 01:00:15So this is a way to show the cardinality
  • 01:00:19of the global list that you see here.
  • 01:00:22And also you can rank the intersections
  • 01:00:24between the different sets according
  • 01:00:26to their size to their personality.
  • 01:00:29So it's much more clearer.
  • 01:00:30To show the structure of the intersection.
  • 01:00:34Rather than using the. A Venn diagram.
  • 01:00:38I skip this because they are.
  • 01:00:41There were some examples of bad
  • 01:00:44usage of graphic in politics.
  • 01:00:47And a lot are looking at online
  • 01:00:50and related to Fox News.
  • 01:00:52Of bad usage of klasa display.
  • 01:00:54So the final part could be how to
  • 01:00:56draw this pad. There is
  • 01:00:58relative they were trying to.
  • 01:01:00Not make the point. Yeah,
  • 01:01:02well they were trying to make him
  • 01:01:04to give a message by distorting the.
  • 01:01:09Yeah, this for example is an
  • 01:01:11issue if you always need to
  • 01:01:13include the zero in your plot.
  • 01:01:15Sir, this is controversial.
  • 01:01:17Let's say that. In general,
  • 01:01:19in Barplots it's a bad idea,
  • 01:01:21but for example is a good
  • 01:01:23idea in in time series,
  • 01:01:25and that's because there in barplots
  • 01:01:27the height of the bar plot is that
  • 01:01:30your main message of the figure,
  • 01:01:32while for example here in in a
  • 01:01:34time series that the main message
  • 01:01:36is how the two trajectories
  • 01:01:38evolve and are interconnected.
  • 01:01:40So the main issue is the horizontal
  • 01:01:43axis and so you can skip the zero.
  • 01:01:47So again,
  • 01:01:47it depends on how much these inclusion
  • 01:01:50or exclusion of the zero distort
  • 01:01:53your your main message of the fever.
  • 01:01:56So how to draw plots?
  • 01:01:58So here are there is an outline
  • 01:02:00of the software that you have,
  • 01:02:03so this is some commercial
  • 01:02:05software from the most from Excel.
  • 01:02:07It's probably the most used or available,
  • 01:02:10but it doesn't allow to plot all the
  • 01:02:13solutions that they did show before, but.
  • 01:02:16For example,
  • 01:02:17Grandpa,
  • 01:02:18Graphpad prism,
  • 01:02:18or Origin Pro are through software
  • 01:02:21that are available and with those
  • 01:02:23that you should be able in an
  • 01:02:26environment that is similar to Excel
  • 01:02:28to produce most of the plots that
  • 01:02:30you saw in the presentation today.
  • 01:02:33So this is commercial software,
  • 01:02:35doesn't require programming
  • 01:02:36skill on these sides.
  • 01:02:38Are you see the main solutions
  • 01:02:40that are used by data scientists,
  • 01:02:42but that require programming
  • 01:02:44skills so that the two most
  • 01:02:46common languages in data science,
  • 01:02:48RR and Python so far are you have is GG plot.
  • 01:02:53Library for Python.
  • 01:02:54You have matplotlib or Seaborn.
  • 01:02:58At these require programming so,
  • 01:02:59but I would say that the advantage
  • 01:03:02nowadays of using visa is that you can
  • 01:03:05find a lot of really a lot of examples.
  • 01:03:09Because there are a lot of website
  • 01:03:12that where you can choose that
  • 01:03:15you're like data visualization
  • 01:03:17type and you see already the code.
  • 01:03:20That you can use in order
  • 01:03:22to produce the blocked.
  • 01:03:23So I would say that you just need
  • 01:03:26to know how to insert that or how to
  • 01:03:28load that in the this programming
  • 01:03:31environment table of data.
  • 01:03:33And then most of the difficulties
  • 01:03:35are probably in fixing details,
  • 01:03:37so it's very easy to realize the plot,
  • 01:03:40different plot.
  • 01:03:41It's more complicated to adapt the
  • 01:03:44small things that we are to your taste.
  • 01:03:48But so so,
  • 01:03:48this suggestion is that if you
  • 01:03:50do a lot of visualization,
  • 01:03:52it's worth investing in this.
  • 01:03:57Here you see a maybe a future perspective
  • 01:04:01that could be their own online solution.
  • 01:04:04They're already available so summer,
  • 01:04:06for example. You can produce upset plots,
  • 01:04:09Aurora rain plot, Sir,
  • 01:04:10or some other like exotic type
  • 01:04:13of data visualization online.
  • 01:04:15So there are websites,
  • 01:04:17web web servers where you can insert
  • 01:04:20your data as tables and they produce at
  • 01:04:23the data that you want and you have.
  • 01:04:27Some sort of interactivity,
  • 01:04:28so that could be the future. Sure.
  • 01:04:32Where are web servers provide you with
  • 01:04:34the main programming environment?
  • 01:04:36You need just to interfere data
  • 01:04:38and you can see by interactively.
  • 01:04:44By the interaction with the web server.
  • 01:04:46How to customize the data?
  • 01:04:48Most of the solutions right now are.
  • 01:04:53Commercial, and so you need to pay,
  • 01:04:56and that's the drawback of this.
  • 01:04:58But it could be probably the
  • 01:05:00future of matching the programming
  • 01:05:02with easiness of usage.
  • 01:05:06This is a useful resource that
  • 01:05:08you can use also to decide which
  • 01:05:11kind of blocked are you want.
  • 01:05:13So there are a lot of these trees are that
  • 01:05:16depending on what you want to represent,
  • 01:05:19one numeric variable to numeric
  • 01:05:21variables or categorical variables,
  • 01:05:22you can follow the tree and arrive
  • 01:05:24to your to the best graphical
  • 01:05:26solutions to display your data.
  • 01:05:28So I suggest you to visit it
  • 01:05:30also to look at what are the
  • 01:05:33kind of possibilities for data
  • 01:05:35representations that you have online.
  • 01:05:37There are many of these sites and
  • 01:05:39now and that's why it's easy to
  • 01:05:42look at the documentation and also
  • 01:05:44to retrieve and reproduce the code.
  • 01:05:46This is another example I closed with Visa.
  • 01:05:52Patricia, that I find particularly
  • 01:05:54related to data visualization and
  • 01:05:57science is not natural itself about
  • 01:06:00its nature under our observation.
  • 01:06:03And so the science of data visualization
  • 01:06:05is a way to allow more adherence between
  • 01:06:10observation science visualization.
  • 01:06:17Thank you come on.