Skip to Main Content

Fantastic plots and how to draw them

 .

Fantastic plots and how to draw them

October 29, 2020

Toma Tebaldi
Associate Research Scientist, Section of Hematology, Yale Cancer Center, Yale School of Medicine
YCCEH Seminar
October 15, 2020

ID
5825

Transcript

  • 00:00Topic of today's is about speaking
  • 00:04about data visualization.
  • 00:05And so it will be very in general on
  • 00:09ramps how to design some strategy,
  • 00:12some issues, some principles to guide in
  • 00:15the visualization of the of our data.
  • 00:20So data visualization is important to
  • 00:22explore the data and this is particularly
  • 00:25crucial since nowadays data are becoming
  • 00:28much more complex and much more bigger,
  • 00:31and so in general there
  • 00:33is a rise of data science.
  • 00:35So not only in research,
  • 00:37not only in biological research.
  • 00:40The second function to data
  • 00:43visualization for data visualization
  • 00:45is to communicate the data and
  • 00:47that may be the most traditional.
  • 00:50I'm so that's what that.
  • 00:54Creating publication figures,
  • 00:55for example,
  • 00:56is about to communicate data to others
  • 01:01because communicating data visually is
  • 01:04more efficient than than words in general.
  • 01:09So in order to represent complex data here,
  • 01:13I collected 3.
  • 01:16General challenges and aims.
  • 01:18So whenever you plot the data is
  • 01:22important that the plots are and
  • 01:24the representations are are precise,
  • 01:27so they're truthful.
  • 01:28That means that distortion has
  • 01:31to be avoided as much as possible
  • 01:34is not always achievable,
  • 01:36so distortion sometimes is unavoidable.
  • 01:39Think about for example,
  • 01:41when you plot the 2D Maps
  • 01:44for representing 3D data.
  • 01:46But the point is that the
  • 01:48distortion doesn't have to convey
  • 01:49the message of the figure,
  • 01:51so it has to be something that is not
  • 01:54related to the main message of the feature.
  • 01:57Otherwise it's a problem.
  • 01:58Then the second point is clarity.
  • 02:01So data the figure has not to be ambiguous,
  • 02:06and the third one is the efficiency.
  • 02:09So every.
  • 02:11Inca every in every pixel is precious,
  • 02:14so each decision in doing your plotter,
  • 02:17each decision on the color on the size
  • 02:20on the number of layers said that you
  • 02:24that you plotter is important and it has an.
  • 02:28Everything has to be has to have a purpose,
  • 02:32so you should reduce the
  • 02:34so called chartjunk here.
  • 02:36Below the slide you see
  • 02:38quotation from Edward.
  • 02:39After that I discovered by the way.
  • 02:43Only yesterday is that he
  • 02:45never knew I'm I'm here from.
  • 02:48Since three years and they
  • 02:50never known that Edward Tufte,
  • 02:52it is the most one of the
  • 02:56most celebrated visualization.
  • 02:58Antisa is is is there in new heaven.
  • 03:03And the condition is that with an
  • 03:05image you have to give to the viewer
  • 03:07the greatest number of ideas in the
  • 03:09shortest time and with the least possible,
  • 03:11Inc.
  • 03:13This is another general representation
  • 03:15of his that you should consider
  • 03:18to make a good visualization.
  • 03:20Also, this is very general, so it's not.
  • 03:23It's not only about science
  • 03:25and basically they criteria are
  • 03:27organized in four different sets,
  • 03:29so you need to represent the information,
  • 03:32but the fever also need to display
  • 03:34to convey to communicate a story and
  • 03:37that's the concept of the figure.
  • 03:40This is connected also with the goal of the.
  • 03:43All of the figures, so the function.
  • 03:47So what is the message that you
  • 03:50want to display and also the visual
  • 03:53format is important.
  • 03:55Obviously the weight of these four
  • 03:58different layers is different in
  • 04:00different applications for images,
  • 04:02so visual form probably is more
  • 04:05important for artistic display,
  • 04:07while for scientific displays that
  • 04:09probably in formation goal and
  • 04:12story are more important.
  • 04:14This doesn't mean that you should not
  • 04:17consider also the visual visual part.
  • 04:20I ideally the perfect visualization
  • 04:22is at the center of these four steps.
  • 04:28So this is for the introduction now.
  • 04:31The rest of the presentation will be
  • 04:33structured with some very concrete examples,
  • 04:35and it's also organized in
  • 04:37a way that is interactive,
  • 04:39so I will show something.
  • 04:41Some example of figures and I will try to
  • 04:44ask you what could be wrong with this figure.
  • 04:48Starting with this,
  • 04:49but this is a figure that is very
  • 04:52frequent in scientific publication.
  • 04:55It's a barplot and it's the most.
  • 04:58It's actually the most frequent disk image
  • 05:01that you can find in biomedical journals.
  • 05:08Do you have any ideas of what
  • 05:11could be wrong with this?
  • 05:13Not pretty, yes. Lots of it.
  • 05:18It. It's lacking data.
  • 05:21It's it's not showing you
  • 05:23the data distribution.
  • 05:24Yeah, yes, exactly there are.
  • 05:26It's putting the treatment on the left,
  • 05:29which I always don't like.
  • 05:31I always want the control on the
  • 05:33left. Oh yes, OK.
  • 05:35Yes, that's true, yes,
  • 05:36so it has a lot of like. Capital A.
  • 05:42Visual problems, but the main yes.
  • 05:44The main thing is that it doesn't
  • 05:47show the data, so that's the main.
  • 05:49That's the main drawback of this image,
  • 05:52and so, particularly in the last year.
  • 05:54So that's the trend that a lot of also
  • 05:57publishers are requesting in images.
  • 05:59So the principle is that you need ideally to
  • 06:02show always the data points in every figure,
  • 06:05because you should show the data that
  • 06:08make up your fingers and these for a
  • 06:11barplot means that you have to show.
  • 06:13They did data points.
  • 06:16So here you see an example of how these
  • 06:19bar plot can be represented with the
  • 06:22data points and you see here on the
  • 06:25on the right you see the the single
  • 06:28data points and also you see a summary
  • 06:30statistics that could be for example,
  • 06:33they mean plus minus standard deviation
  • 06:35for the treatment and the end the control.
  • 06:38In general this showing barplots
  • 06:40with only the mean with the standard
  • 06:42deviation is a problem and there was
  • 06:45a publication of five years ago.
  • 06:47The teacher, wife and one issues,
  • 06:49for example,
  • 06:50that the different data distribution
  • 06:52can lead to the same bar block.
  • 06:55You see an example here.
  • 06:57So in a you should use.
  • 06:59You see a barplot representation
  • 07:01of distribution of data and all the
  • 07:04distribution that you see from B2ER
  • 07:06representing could be represented
  • 07:08by that padlock.
  • 07:09So the ideal situation would be
  • 07:11what you see here in plot.
  • 07:14Be where you have data that are.
  • 07:18Symmetrically distributed,
  • 07:19so this is the if the distribution
  • 07:21of your real data is.
  • 07:23These are the bar plot is less problematic,
  • 07:26but for example in C use your situation
  • 07:28where you have an outlier and so
  • 07:31for example this would mean that the
  • 07:33supposed difference that you are
  • 07:35showing in the padlock is not real,
  • 07:38but it's present only because
  • 07:39you have these outlier pointer.
  • 07:41But most of the other data are
  • 07:44overlapping in the two distributions.
  • 07:46Sometimes as you see in the,
  • 07:49this plot could hide some patterns in
  • 07:52the data, so that's what you see here.
  • 07:55In the do you see my cursor?
  • 07:59Yes, OK,
  • 08:00so this for example shows that there are.
  • 08:05The distributions that you see
  • 08:07here are by model.
  • 08:08This could be linked, for example,
  • 08:11to replicate for example,
  • 08:12technical replicates and
  • 08:13biological replicates,
  • 08:14or it could be an important
  • 08:16property of the data.
  • 08:18Nevertheless,
  • 08:18it's something that you cannot
  • 08:20see if you represent with with
  • 08:23a bar plot and the also Bartlett
  • 08:25hide the number of data that are
  • 08:27used to visualize the plot.
  • 08:29The barplot themselves and so
  • 08:31for example in EU situation,
  • 08:33where you have an equal number of.
  • 08:35Points for the black and the white are
  • 08:39Bartlett on the left and the right.
  • 08:42At this is a problem also when you
  • 08:45want to show paired data in barplots.
  • 08:49So again here you see a situation
  • 08:53where a barplot
  • 08:54displays some is the same.
  • 08:57For situations that you see
  • 08:59displayed in BC&D, so be Cmdr.
  • 09:01Very different situations here
  • 09:03you could imagine, for example,
  • 09:04that this data obtained from single
  • 09:06patients at treated with the dragon,
  • 09:09and you measure a parameter of the patients,
  • 09:11and so the information related to each
  • 09:14patient has to be connected so that the
  • 09:17meaning of the of the pair that plot.
  • 09:20So the situation in B shows that the
  • 09:22Dragon has a consistent effect on all
  • 09:25the patients and you can see that.
  • 09:28Calculating for each patient,
  • 09:30the difference between the dots on
  • 09:32the left and on the right give rise
  • 09:35to this to this plot here below,
  • 09:38where all the differences are
  • 09:40positive and are also consistent in.
  • 09:42See you see a situation where
  • 09:44the drug has very big,
  • 09:46very different effects depending
  • 09:48on the patient.
  • 09:50So that the distribution of
  • 09:52the differences is skewed.
  • 09:53And by the way,
  • 09:55this line represents the median
  • 09:57difference that you see for
  • 09:59each patients for the treatment.
  • 10:01And the third plot indeed that you see
  • 10:03has a composition of effects that.
  • 10:05So here you see that the again
  • 10:08the difference is by model.
  • 10:09That means that there are patients
  • 10:11that do not respond to the dragon,
  • 10:14and you see here with the
  • 10:15horizontal lines and some patients
  • 10:17that responded to the dragon.
  • 10:19So the resulting distribution of the
  • 10:21difference as you see here is by model.
  • 10:24The problem with her plots
  • 10:25are and the problem.
  • 10:26Also,
  • 10:27if you use barplots with paired
  • 10:29data is that you don't see any.
  • 10:31Any of this structure so
  • 10:33you you are losing it.
  • 10:34So the best way is always to show
  • 10:36the dots are of your distribution,
  • 10:38maybe together with the bar plots
  • 10:40and if the data are paid also
  • 10:42to show the single connection
  • 10:44with in between the dots.
  • 10:49There is also an issue about about the
  • 10:52choice of displaying the meme of your data,
  • 10:56for example versus the media,
  • 10:58or to show the standard deviation
  • 11:01versus the standard error of the mean,
  • 11:04so mean versus median are ways to represent
  • 11:07summary of the centrality of a distribution.
  • 11:11An the mean is preferable if you
  • 11:14suppose your data are for example.
  • 11:17Symmetrically distributed.
  • 11:18For example, if you assume that the data
  • 11:21has a normal or Gaussian distribution,
  • 11:23while the median represents the
  • 11:25mid is the point that represents
  • 11:28the middle of your data.
  • 11:30The middle of your distribution,
  • 11:32and it's more generally applied
  • 11:34independently from the shaper
  • 11:36of the distribution of the data.
  • 11:38So here you see an example where you
  • 11:41have four different samples population
  • 11:44and you plot the mean plus the standard.
  • 11:48And that's the most conventional
  • 11:50way that you see in publication.
  • 11:52They mean plus standard deviation.
  • 11:55And the median of the population will
  • 11:58receive the single point and the
  • 12:00horizontal bar represents the median.
  • 12:02So an important point about.
  • 12:06Mean versus median is that the
  • 12:08mean and can be used only with
  • 12:11symmetrical distributions.
  • 12:13Otherwise it can be misleading.
  • 12:15While the median is more
  • 12:17generally appropriate.
  • 12:19When you have an outlier like that,
  • 12:21you would always recommend the meat being.
  • 12:24Honey. When you have an outlier
  • 12:27like in the third group there,
  • 12:29yeah, then it makes more sense to
  • 12:31use the median.
  • 12:33Yeah, nobody showed them young.
  • 12:34Is that the median is more robust
  • 12:37data with outliers is totally
  • 12:38more robust with outliers,
  • 12:40and the median is not,
  • 12:42so the presence of over outlier as
  • 12:44you see here in C can shift a lot the
  • 12:48mean while the median is is affected,
  • 12:50but not so much.
  • 12:53Especially from the magnitude
  • 12:55of the outlier,
  • 12:56I would say. So
  • 12:58tomorrow question right?
  • 12:59So so also, being you know,
  • 13:02knowing there's a difference
  • 13:03between me and a medium,
  • 13:05but one of the things I heard,
  • 13:07of course, haven't looked
  • 13:09myself into this deeply enough.
  • 13:11Is that for the meeting the distribution,
  • 13:13unlike mean not necessarily follows
  • 13:15a Gaussian or normal distribution,
  • 13:17so that from a statistical point of view
  • 13:20is going to be a little hard to calculate,
  • 13:23certain significance etc.
  • 13:24Based on medium data.
  • 13:26Is that true?
  • 13:27Or it's simply a misnomer?
  • 13:30How to calculate the
  • 13:32significance of differences?
  • 13:34That's a different.
  • 13:36So that's the difference of the approach.
  • 13:39If you choose parametric test,
  • 13:41such as the tester or the ANOVA
  • 13:44and those tests assume that the
  • 13:47distribution is Goshen is normal.
  • 13:49Yeah, so you need to be careful so he is
  • 13:52usually if it is a repeated measures.
  • 13:55So if you're testing repeated
  • 13:56measure yes soon the error is is
  • 13:59is distributed in a Goshen way,
  • 14:01but that is not always the case.
  • 14:03For example,
  • 14:04if you're comparing two population of
  • 14:07jeans with a signal for each gene.
  • 14:09Just have to check it.
  • 14:11Well, so so this is something that I
  • 14:13think will be particularly important
  • 14:15for experimental scientist, right?
  • 14:17Because you know, as an experiment
  • 14:18is when we are trained, we know OK,
  • 14:21when we did design experiment,
  • 14:22we do service replica so we can
  • 14:24join error bar without thinking
  • 14:26Y and how to deal with it.
  • 14:28And if you go to a statistician
  • 14:30that will tell you say oh look,
  • 14:32if you're going to use the test,
  • 14:35you have to show me first that this is
  • 14:37actually largely a normal distribution
  • 14:39before you can actually use the T test.
  • 14:42Whereas the vast majority
  • 14:43of people in the lab,
  • 14:45that's not how they will
  • 14:46think about in the 1st place,
  • 14:49and they also not trendy enough
  • 14:51to think you know how to prove
  • 14:53or disprove that's the case.
  • 14:55So what would you suggest,
  • 14:57especially when we're doing
  • 14:58experiment that you cannot do
  • 15:00200 replicas for each experiment.
  • 15:02So what would be a good
  • 15:04approach in that regard?
  • 15:06Yeah, so there is a tradeoff between
  • 15:09the ideal situation where the ideal
  • 15:11situation would be always to have
  • 15:14enough data points so that you can
  • 15:16understand the shape of the distribution
  • 15:19and the real case scenario with you
  • 15:22can do as many replicates as you can,
  • 15:24and so usually you have to assume
  • 15:27that the distribution is normal, so.
  • 15:33Ideally, you should always check her.
  • 15:36And again, if we are repeating measures
  • 15:39and you are collecting a measure
  • 15:41of the same data in a repeated,
  • 15:44that way you can assume that
  • 15:46if the error is stochastic,
  • 15:48it should be normally distributed.
  • 15:50So you assume that the distribution of
  • 15:52the error is Goshen, and it makes sense.
  • 15:55But for example in other situation
  • 15:57where you have a lot of measurements
  • 16:00and measurements of different entities,
  • 16:02for example, the expression of
  • 16:04different genes we're doing like.
  • 16:07A compilation of Jesus.
  • 16:08Then these assumption is less probable,
  • 16:11is less likely,
  • 16:12and you should have enough data points
  • 16:15so that you can switch from parametric
  • 16:18tests one on parametric, so we're not.
  • 16:23That doesn't make assumption of
  • 16:25on the underlying distribution.
  • 16:26It is, for example,
  • 16:27they will cook some test or the
  • 16:30Mann Whitney test.
  • 16:31And the problem is that you need them
  • 16:34or replicates because if the end is
  • 16:36the size is less than five, you don't.
  • 16:39You cannot reach the statistical
  • 16:43significance as it is accepted below 0.05,
  • 16:47but it's usually the more correct way.
  • 16:53Then they the standard is not to use that,
  • 16:55and so I remember there was a case
  • 16:57where the paper was in review.
  • 16:59It was from.
  • 17:02Young bean and I remember we performed
  • 17:05the Wilcoxon test and the reviewers
  • 17:07as to why we didn't do the parameter
  • 17:10test so so they asked for the opposite.
  • 17:13They asked us to go against the
  • 17:15ideal situation.
  • 17:18I think this is very helpful.
  • 17:20I think it's really,
  • 17:21you know telling about me,
  • 17:22especially for people who
  • 17:24are not familiar with with.
  • 17:26Test and also the the World Cup test.
  • 17:29I think it's really suggest
  • 17:30you to look into that.
  • 17:32Things can be very helpful.
  • 17:33Yeah, and and obviously you're
  • 17:35so it's important when you plan.
  • 17:37If you if you can to have enough data
  • 17:40points to perform a nonparametric test.
  • 17:43In high throughput
  • 17:44experiments that they see now,
  • 17:45for example single cell that's
  • 17:47not there anymore problem because
  • 17:49you have usually a lot of data
  • 17:51points and so that's less of a
  • 17:53problem that sometimes we work.
  • 17:55Is it after him because they thought
  • 17:58the number of data are increasing?
  • 18:00And not
  • 18:01that generic comment comma. These people
  • 18:04are a lot of these lot of our group is blood
  • 18:10hematology researchers. Yeah,
  • 18:11and neither blood nor blood advances require.
  • 18:16The investigator in their papers
  • 18:17to show all the
  • 18:19data points. And now
  • 18:21I'm on the publication committee.
  • 18:22We've actually talked about this,
  • 18:24but we go by the Journal of Cell Bio.
  • 18:28Instructions to authors and prep for figures,
  • 18:30and there are Rockefeller Press
  • 18:32publication. And they
  • 18:34haven't. So they have genome research
  • 18:36and germ cell bio Med and stuff,
  • 18:38so they haven't come around to making
  • 18:41people show all their dots etc.
  • 18:43But a number of journals, as you know,
  • 18:46half like JC I you know JC AI
  • 18:50advances etc. They might not
  • 18:51even review your paper if
  • 18:53you show, for instance,
  • 18:54your plots on the left here.
  • 18:56Well, they might, you know,
  • 18:57might not even go out
  • 18:59for review. The pre reviewer's will say
  • 19:01you know your figures are inadequate
  • 19:03for our instructions, authors etc etc.
  • 19:05So I think some journals are
  • 19:07coming around to this is the way
  • 19:09we really want to see the data.
  • 19:12Yeah, I think there is a shift
  • 19:15in the paradigm, let's say,
  • 19:16and it will take years.
  • 19:18But for example, I have a slide here
  • 19:21where so this is from my experience,
  • 19:24so that for example,
  • 19:25all the family of the network journals
  • 19:28have already this policies for the figure.
  • 19:31So this is something I received after
  • 19:34the review of a paper as an editorial
  • 19:38guidelines and the food for these like.
  • 19:41Policies that I had to change a
  • 19:44lot of figures and you see that.
  • 19:47And so that the one of the
  • 19:49policy as you see here,
  • 19:50the last one is that for sample
  • 19:52size that are less than 10.
  • 19:54And they want you to get to plot
  • 19:57the individual data points and
  • 19:58so they don't accept bar graphs.
  • 20:00Got bargraphs anymore.
  • 20:03And then, for example,
  • 20:05if you have some statistics such as
  • 20:08error bars with the lesson 3 replicates,
  • 20:11you have to remove,
  • 20:13remove them and you have to show
  • 20:15to show the data without the
  • 20:18statistics without the error.
  • 20:20Then this also is a point that
  • 20:23you usually is not satisfied.
  • 20:25So when you plot some statistical
  • 20:28significance values,
  • 20:29they don't accept anymore,
  • 20:31they start the stars.
  • 20:33But you have to provide the
  • 20:35precise P value in the figure.
  • 20:38It means that you have some stars.
  • 20:40You have to change the stars that
  • 20:42converting start to the precise P
  • 20:45value before before publishing and
  • 20:47then also you have to provide the
  • 20:49precise number size for each of your bars.
  • 20:52For example,
  • 20:53I mean I,
  • 20:53I think in the past it was enough
  • 20:56to provide a range like from
  • 20:58three to six replicates,
  • 21:00but now they really want the number for each.
  • 21:04For each app and population,
  • 21:05for each sample that you have.
  • 21:08So these are,
  • 21:10in my experience were something
  • 21:12that I had to provide that,
  • 21:15but after the radio so it was not.
  • 21:18It was the editorial like.
  • 21:22At stage of acceptance of the paper,
  • 21:24and I think this is true now for all
  • 21:28the families of the of the natural.
  • 21:32Jordans
  • 21:34it can I add something?
  • 21:36Although this is only for
  • 21:38publication that goal of publication,
  • 21:39but it's important that we start
  • 21:42practicing all these rules in
  • 21:44our daily life because it's so
  • 21:46painful that you have to do this
  • 21:49when you you're trying to get
  • 21:51the figures into the Journal.
  • 21:53It's a lot easier to do it while you're
  • 21:56making the figures in real life.
  • 21:59Yeah, so obviously it
  • 22:01says worker before there.
  • 22:02Yeah it says work because otherwise
  • 22:05you have to repeat all day. Fevers so
  • 22:10yeah also echo that,
  • 22:11and also just want to say that you know
  • 22:14I used to just use Excel to placings.
  • 22:17But since my many of my lab members
  • 22:19start to use Graphpad prism to plot,
  • 22:22that makes a huge difference in
  • 22:24converting between different types
  • 22:25of parts such as this kind of things.
  • 22:28If you had a bar bar graph,
  • 22:30Indiana in that software,
  • 22:31then you can very easily change that to a
  • 22:34bar graph with different dots distributed.
  • 22:36So it's very easy to work with.
  • 22:40Yeah, that's also I have something
  • 22:42at the end of the presentation.
  • 22:44So basically there are a lot of tools now
  • 22:46more or less commercial, but tequila.
  • 22:49They aren't really available.
  • 22:51U as which are too many different formats
  • 22:55and starting with the same initial data,
  • 22:58basically formatted as a table.
  • 23:01So that from the same table you can switch
  • 23:03to there too many different visualizations.
  • 23:06So that's that's true,
  • 23:08and it's probably easier also to plot these
  • 23:11dots with single dots as it was in the past.
  • 23:15Without respect.
  • 23:18OK, so that was the main point of this part.
  • 23:22I had a part on the standard
  • 23:24deviation standard error.
  • 23:26That's another issue because the
  • 23:28standard error is basically the
  • 23:29standard deviation divided by the square
  • 23:32root of the number of experiments,
  • 23:34and so usually the standard
  • 23:36error is displayed.
  • 23:37But you have just be careful that it's
  • 23:40a measure that tends to go to zero
  • 23:43just because they increase the number
  • 23:45of replicates or the number of points.
  • 23:48So you see an example here where
  • 23:50it seems by plotting the standard
  • 23:52error that the black bar and the
  • 23:55white bar have the same like measure
  • 23:58of spread of the data.
  • 23:59But if you look at the standard
  • 24:02deviation you see that this is
  • 24:04an effect of the factor.
  • 24:06Today the Black bar has higher spread,
  • 24:08but also more points,
  • 24:10and that's why the standard
  • 24:12error seems seems the same.
  • 24:16So that's another another issue.
  • 24:18So obviously for publication at the
  • 24:20standard error of the mean is preferred,
  • 24:23because it usually gives an impression
  • 24:26of the data being less sparse.
  • 24:30But especially with different
  • 24:31number of samples in different
  • 24:33in different bars that it could.
  • 24:35This could be misleading.
  • 24:40And all these issues were presented
  • 24:42in these in this paper published
  • 24:44five years ago in in plus biology.
  • 24:49I would skip this,
  • 24:50just that we will touch this later,
  • 24:52but an alternative solution if
  • 24:54you have enough data points.
  • 24:56So I would say more than 10.
  • 24:59An alternative solution instead
  • 25:01of showing like but lotsa Ann
  • 25:03is to show the distribution of
  • 25:05the data is box whisker plot.
  • 25:07As you see here they have some light
  • 25:11model with more details on this.
  • 25:14OK, so this is the next example.
  • 25:17I think it's a biplot with the
  • 25:19usage of the different browsers,
  • 25:22so this is extra science image so.
  • 25:26So this is a classic example
  • 25:29in like visualization lessons.
  • 25:31So what could we run with this?
  • 25:39There's no end. Yeah, so that's a yes,
  • 25:44so there is no endless so that you cannot.
  • 25:47You don't know of how many,
  • 25:50how many data points you use that in
  • 25:52order to build the other frequencies.
  • 25:55Obviously pie charts are used to display
  • 25:59frequencies and proportions of some
  • 26:02classes that sum up to 100 or or to one.
  • 26:05The main problem is that so
  • 26:07the idea is that you shouldn't.
  • 26:10You should avoid by chance.
  • 26:12So the idea for displaying an
  • 26:15information of our proportion or of
  • 26:17a percentage as a pie chart are is.
  • 26:21Not the best choice.
  • 26:23Because that it was shown that humans
  • 26:27are very bad at reading angles,
  • 26:30so we're not very precise,
  • 26:32precise in understanding differences between
  • 26:35angles and so between the designs of the.
  • 26:39Slices of the pie and so usually if you
  • 26:44convert the pie chart into a bar plot.
  • 26:47Information is much more clear.
  • 26:49It's true that the pie
  • 26:51chart is more aesthetic.
  • 26:53Appeared, but the bar plotter
  • 26:55is in in any circumstances,
  • 26:57usually more affecting in displaying girl.
  • 26:59For example, differences in the
  • 27:02usage of this genome browsers.
  • 27:04So this has been a long issue and if you
  • 27:07in many presentation so there is always
  • 27:10this suggestion to avoid at all a pie charts.
  • 27:13There are also some example of these.
  • 27:15So these are three pie charts and you can see
  • 27:19that it's they are different from each other.
  • 27:22But it's very difficult to
  • 27:24understand that the difference,
  • 27:25so the difference is is in the size
  • 27:28of the slice of the three pies,
  • 27:31but it's very different.
  • 27:32For example, to understand in each
  • 27:35pie which one is the largest slides.
  • 27:38And to draw comparison it much more
  • 27:41more easier to understand these issues.
  • 27:44So which pie is larger if the information is
  • 27:48not displaced is not displayed as pie charts,
  • 27:51but as market.
  • 27:54So that's on the web.
  • 27:56I also found these provocative.
  • 27:59Label of pie charts as lighters.
  • 28:03So in general it would be better to avoid
  • 28:06displaying information as pie chart.
  • 28:08And prefer a bar chart instead
  • 28:10to show the same information.
  • 28:14OK, so that was faster.
  • 28:16This is another example.
  • 28:18What could be wrong with this plot?
  • 28:20Again, we have a treatment.
  • 28:21We have a control.
  • 28:22This time we see the data point.
  • 28:26Scale is so wrong, so it covers
  • 28:29the distribution of the lower end.
  • 28:31Yes, exactly so this is a case where
  • 28:34most of the data are compressed,
  • 28:37since they have very different magnitude.
  • 28:39Most of the data are compressed
  • 28:42air in a very small part of the
  • 28:46plot and we cannot understand.
  • 28:48Very much how they are distributed
  • 28:51because most of the plotter is related
  • 28:53to these kind of two outliers.
  • 28:56So this is an issue with the measures
  • 28:58that have different magnitudes,
  • 29:00so it could in my experience it happens.
  • 29:03For example in gene expression measurements.
  • 29:07Because they can vary,
  • 29:08especially with the sequencing.
  • 29:10They can value of four to five.
  • 29:13Magnitude and the main way to solve this
  • 29:16issue is to log transform the data.
  • 29:19So instead of plotting in a
  • 29:21linear scale to log normalizing,
  • 29:24the scale of the data and this
  • 29:26allows to restrict the distance
  • 29:28between these two points,
  • 29:30the outliers,
  • 29:31and allow you to see also the
  • 29:34distribution of the points.
  • 29:36That here seems all compressed.
  • 29:40So usually log transformation allow you to
  • 29:43capture some information on the difference
  • 29:45of your points that are more clear.
  • 29:48Not in all cases, but in some
  • 29:50cases rather than displaying the
  • 29:52information in a linear scale,
  • 29:55especially when you have a lot of range
  • 29:58between your minimal and maximal.
  • 30:00Measurements. An alternative way.
  • 30:04Is not also a panel breaks,
  • 30:07so personally I prefer log log
  • 30:10transformation over panel breaker because
  • 30:13there is mathematically more likely.
  • 30:16Linear or elegant,
  • 30:18but there are situations where you can.
  • 30:20You can choose so this is an example.
  • 30:24You have a bar chart.
  • 30:26You have a huge difference between
  • 30:28the measurements of a 2D and E&F.
  • 30:31So this is how you solve the problem by
  • 30:35introducing a breaker in your panel.
  • 30:38So from 25 to 200 to 210 and this is the
  • 30:41equivalent solution by log transformation.
  • 30:44As you see,
  • 30:46the solution that the two solutions
  • 30:49give a fight a similar result.
  • 30:51But here you insert the manual break of
  • 30:54the data and this could be misleading.
  • 30:57Here you saw the issue by log
  • 31:00transforming all the measurements.
  • 31:02So this is for example is an advantage
  • 31:04because it affects all the measurement.
  • 31:07And while this panel breaker
  • 31:09affects only for example,
  • 31:11these two bars and could distort the data.
  • 31:18Another another scenario where you should
  • 31:22consider log transformation is these.
  • 31:26This could be a plotter that shows for gene
  • 31:30expression levels from Aaron Isike for.
  • 31:33Population of jeans.
  • 31:35So each gene could be a doctor
  • 31:37and he received it.
  • 31:39There is a different year age,
  • 31:41but you see that there are outliers like
  • 31:44for example genes of ribosomal proteins.
  • 31:46Histones usually are in these.
  • 31:49Are in this part of the plot,
  • 31:52but most of the gene are 90% of
  • 31:54your jeans are in this part of the
  • 31:56plot and you cannot really see.
  • 32:01You cannot really inspect them
  • 32:03because most of the plot is
  • 32:05dedicated to some outliers.
  • 32:07So again, here is a situation
  • 32:09where you can log transform.
  • 32:11Both are the coordinates.
  • 32:12So let's say that here is the
  • 32:15control and this is the treatment
  • 32:17and this will allow you to see more
  • 32:20in detail the differences in the
  • 32:22expression of the bug of your jeans.
  • 32:28In a situation like Visa,
  • 32:30you should also consider
  • 32:32issue if you're interested,
  • 32:34for example in showing differences in
  • 32:37expression between 3 between a control.
  • 32:39For example, are one and
  • 32:41the treatment are two.
  • 32:43You have also the possibility to show.
  • 32:48As the Y axis,
  • 32:50the differences in the log values.
  • 32:52So this representation
  • 32:53here is the same as this,
  • 32:56but it maximizes the visualization
  • 32:58of the differences in the
  • 33:00expression levels of genes.
  • 33:01So this is something that you
  • 33:04find a cold as as an MA plot.
  • 33:07It was introduced with the
  • 33:09analysis of microarray data,
  • 33:11but you can find it also
  • 33:13with sequencing data.
  • 33:15Sometimes these two different
  • 33:16visualization are used.
  • 33:17Depending on the aim of the figure,
  • 33:20so sometimes you will find these,
  • 33:21especially when the message of the
  • 33:23figure is that you don't see big
  • 33:25differences between the two conditions,
  • 33:27while if the message is that you find big
  • 33:29differences between the two condition,
  • 33:31you will find mostly these visualization.
  • 33:35So here I would just point out that
  • 33:37in any at any sequencing experiments,
  • 33:40you will probably never find any gene
  • 33:44that is in this area because they.
  • 33:48But most of the genes,
  • 33:50the main difference they make
  • 33:51the main like determinant,
  • 33:53is the basil expression levels.
  • 33:55So usually your perturbations do not
  • 33:57affect so much the expression of a gene,
  • 34:00so that the gene is in these
  • 34:03area of the plot of the oranges.
  • 34:06And that's why this visualization
  • 34:08is much more efficient in capturing
  • 34:10the expression differences.
  • 34:12Because they scale on on the
  • 34:15expression at baseline.
  • 34:19OK, so now I have a section I don't
  • 34:21know that I'm I have a section
  • 34:24about how to display distributions.
  • 34:29So let's say that
  • 34:29we have a display. One time you had 15
  • 34:32minutes and if we go a little over, that's
  • 34:34OK, OK? So when you have to
  • 34:37represent the distribution of data,
  • 34:39you have many choices.
  • 34:40The histogram is one of the most used choice.
  • 34:44It has the advantage that it can present.
  • 34:48With detail, the shape of the
  • 34:51distribution of your data.
  • 34:53And so basically you have a variable of
  • 34:56interest that usually is a continuous
  • 34:58variable and you wanted to show
  • 35:01how this variable is distributed.
  • 35:03So you divide the range of the values in
  • 35:06some beans and then you count the number
  • 35:09of points that fall inside each being.
  • 35:12The issue with the histograms is
  • 35:14that you should be careful when when
  • 35:17building the histograms and when looking
  • 35:19at the histograms that there are
  • 35:22some are being arbitrary parameters.
  • 35:24In building up his histogram,
  • 35:26mainly the choice of the bin size.
  • 35:30So this is an example where the same
  • 35:33distribution of data that is the
  • 35:35distribution of the price of abedy
  • 35:38apartments in French City has been
  • 35:40being there in two different ways.
  • 35:43So here is the price and hear the bin sizes.
  • 35:47So the size of each of the bin is 10.
  • 35:52Dollars.
  • 35:53While it in here on the writer it is
  • 35:57of $2 so you can see that using more
  • 36:00granular bins allow you to see some
  • 36:04the presence of some accumulations
  • 36:06in your data that you cannot really
  • 36:09see with the larger bin size,
  • 36:11and this could be important because
  • 36:14these accumulation this probably
  • 36:16are accumulation of price that are
  • 36:19due to the fact that they are prices
  • 36:22that are commonly used.
  • 36:23By many different Airbnbs, for example,
  • 36:26because they are multipliers of 50 or 100,
  • 36:29for example.
  • 36:30But the fact is that depending on
  • 36:33the choice of the bin,
  • 36:34you see a different story.
  • 36:38And then you should be always
  • 36:42careful to select bin size.
  • 36:45That doesn't affect too much data.
  • 36:49There are also software tools
  • 36:51that calculates depending on
  • 36:53your data depending on squared,
  • 36:56your points are placed the best
  • 36:58and size of the bins so that you
  • 37:02reduce the distortion of your data.
  • 37:09An alternative way to represent
  • 37:10distribution is to use a density plot.
  • 37:13So a density plot is basically
  • 37:15a smoothing of a histogram.
  • 37:18Here you collect being said and here
  • 37:21use motor the shape of the distribution
  • 37:23so that you have a continuous function.
  • 37:27This is graphically nice.
  • 37:30And it allows to compare,
  • 37:31for example distribution of
  • 37:33two variables as you see here
  • 37:35in green and in and in Violet,
  • 37:37and the advantages that you can see also
  • 37:40complex shapes of the distribution.
  • 37:42For example here the bimodality
  • 37:44or hear the presence of this show
  • 37:46is that of the distribution.
  • 37:48The pitfall is similar to the histogram,
  • 37:51so you should always be careful
  • 37:53in selecting the.
  • 37:54How much is Martha the distribution?
  • 37:56And here you see an example.
  • 37:58So these are the.
  • 37:59Points that were used at the single
  • 38:02points that were the that were used
  • 38:04in order to build the distribution.
  • 38:07They were randomly chosen from a normal
  • 38:10distribution and you can see that.
  • 38:12Problem is similar to the bin size,
  • 38:14so here you have to select basically.
  • 38:19A wavelength in order to approximate
  • 38:21that the function to a curve
  • 38:23and depending on the wavelength,
  • 38:26the resolution of the
  • 38:27wavelength that you choose.
  • 38:29The result is different,
  • 38:31so you could have this kind of plot
  • 38:34that seems to show a lot of local pixel,
  • 38:38but by smoothing more you have
  • 38:41instead the normal distribution
  • 38:43from which you draw the data so.
  • 38:46There is a balance which appear
  • 38:48in choosing beings that are two
  • 38:51larger or hear excessive smoothing.
  • 38:53Because these over simplifies
  • 38:54the original distribution,
  • 38:55but on the other side,
  • 38:57if you take a resolution that is too small,
  • 39:01too granular,
  • 39:02you can obtain that strange effects.
  • 39:04So you could see for example,
  • 39:06pics that are depending on the
  • 39:09extraction of random numbers.
  • 39:11Again,
  • 39:11also in this case there are softwares
  • 39:15that given the the original data,
  • 39:18your original vote data can calculate the
  • 39:22optimal smoothing wavelength in order
  • 39:26to avoid distortions based on your data.
  • 39:29A compact way to represent the
  • 39:31distribution is the box whisker plot,
  • 39:34and here you can see how a box
  • 39:36whisker plot they can be obtained
  • 39:39by this distribution of 20 points.
  • 39:41So basically the box whisker plot
  • 39:43represents as a box 50% of the data
  • 39:46of the distribution to.
  • 39:48Usually you have a central line
  • 39:50that is the media.
  • 39:51It's important,
  • 39:52not laminar,
  • 39:53but in the box whisker is always the medium.
  • 39:56This is the first quartile
  • 39:58and the third quartile.
  • 40:00420 Percent 25th percentile of the data.
  • 40:0375th percentile of the data.
  • 40:05So in the box you have 50%
  • 40:07of your day to the central.
  • 40:09Here 50% of your data.
  • 40:11Then you have the whiskers.
  • 40:14They are standard definition of the
  • 40:17Whisker Lanka is that they are as
  • 40:20long as the interquartile range.
  • 40:22That's the distance between Q1 and Q 3 * 1.5.
  • 40:27And you see these as the whisker
  • 40:30of your plot.
  • 40:32So these collect most of the
  • 40:34distribution of your data.
  • 40:36The data that are outside the whiskers
  • 40:38are considered to be outliers.
  • 40:40For example,
  • 40:41here you see there these three points.
  • 40:44They are outside the whisker size,
  • 40:46and so these usually are individually
  • 40:49displayed in the whisker plot and are
  • 40:52considered to be an outlier according
  • 40:54to this definition of the whiskers.
  • 40:57Yes,
  • 40:58if you wanted
  • 40:59to make these plots, yeah,
  • 41:00is there an easy way to do it
  • 41:03or do you like you personally,
  • 41:05just do it by in R or something?
  • 41:08Well, box plot. I don't think
  • 41:10you can do them with Excel,
  • 41:13but for example with Prisma
  • 41:15or Origin you can totally.
  • 41:20I think the only limitation is is
  • 41:22Excel, but I be honest, I didn't
  • 41:24check the last version of Excel.
  • 41:27Right for us to think about,
  • 41:28you know we can we have our data and there
  • 41:31are many different ways of plotting it,
  • 41:33but it sounds like prison might be the
  • 41:35way to go in to try to do it in less.
  • 41:37You're somebody like you.
  • 41:38Who knows how to put it into our.
  • 41:41Yes, probably, so please MA is it?
  • 41:45Give you an option that is much.
  • 41:47Use that if usually use them
  • 41:48originally with respect to Prisma.
  • 41:50I think it has more.
  • 41:52I'm more power,
  • 41:54so there are more things that you
  • 41:56can do with origin then please MA.
  • 41:59I think because it was designed
  • 42:01for the for the physics community,
  • 42:04but the tradeoff is always complexity,
  • 42:06so please May is has less power,
  • 42:09less choices, but it's easier
  • 42:10to use rather than than origin,
  • 42:13but both share the same philosophy
  • 42:15so that you need to provide the data
  • 42:18is a spreadsheet format and they are
  • 42:21available in the software library at.
  • 42:24OK, thank you to my can you say
  • 42:27the name of the other not prism
  • 42:29but the other programming?
  • 42:31Or I have a slide after whether you show
  • 42:34its origin? OK, thanks yeah.
  • 42:37Ava question so,
  • 42:38so my initial understanding is that
  • 42:40the whisker lenses representing the
  • 42:4295 percentile of the data range.
  • 42:45But here it says the whisker
  • 42:47length is 1.5 times this IQR lens.
  • 42:50But if that's the case,
  • 42:52why would the left side of
  • 42:54the screen right side of risk
  • 42:57are having different lens?
  • 43:02Um? So that could be for example
  • 43:06because here you have the, so that's
  • 43:09the the maximal length of the whisker.
  • 43:12But if the minimum of your
  • 43:14data that is here is here,
  • 43:17the whisker stops. So that's why.
  • 43:19So I see here you have outliers and
  • 43:22so that we can extend to the maximum
  • 43:25point that is 1.5 at this measure.
  • 43:27But if you before the the maximal distance
  • 43:30here you meet the minimal pointer,
  • 43:32the whisker and there and
  • 43:34there so that's why.
  • 43:36OK, I see it's also true that these
  • 43:39whisker definition can be customized,
  • 43:41so this is the default interpretation.
  • 43:43I don't know who who decided this.
  • 43:46I don't have the original publication,
  • 43:48but you can choose whiskers to
  • 43:50be differently, so that's why.
  • 43:52Also in the Network Journal
  • 43:54paper when you do a box plot,
  • 43:56you have always to specify in the statistical
  • 43:59methods how you design your box plot.
  • 44:02So you have to provide how,
  • 44:04for example, the skirts were defined.
  • 44:07Because sometimes it's true that,
  • 44:08for example,
  • 44:09the whisker can represent like
  • 44:1195% of the distribution.
  • 44:13Right, so this is just the default,
  • 44:15but it can be customized,
  • 44:17so there are different choices.
  • 44:21I have a question regarding the
  • 44:24distribution again, maybe it's in
  • 44:26continuation to what you just said.
  • 44:30Some softwares allow a default value
  • 44:32for the bin size and for the smoothening
  • 44:36and all that say like Matlab that
  • 44:38I've been trying to put this into.
  • 44:41How reliable do you think that is?
  • 44:44The default values and how would you suggest?
  • 44:48Most of the time,
  • 44:49most of the time, so I don't.
  • 44:52I don't have experience with matter,
  • 44:53but probably it will be that it's the
  • 44:56same in our so so most of the time
  • 44:58there is a sort of optimization there,
  • 45:01so most of the time is fine. Uh, but.
  • 45:07Sometimes, especially if you
  • 45:09have a distribution of data,
  • 45:11but you also have a pointer
  • 45:14with cumulation of data.
  • 45:16You could have problems in the.
  • 45:20In the blocker so.
  • 45:22But I don't have an example.
  • 45:25OK, so like in 95% of the time I'm OK
  • 45:30with the with the solution that is
  • 45:33provided by the MATLAB or RA building tool.
  • 45:38For example, sometimes when you compare
  • 45:40to distribution with a different size
  • 45:42with a different number of points,
  • 45:44that could be that that can be a problem.
  • 45:48Because sometimes there.
  • 45:50If you're comparing for example
  • 45:52distribution with 10 points with
  • 45:54a distribution of 1000 points.
  • 45:56Adopting the same wavelength
  • 45:58could be a problem,
  • 45:59and so you need to manually change it.
  • 46:03So that's the yes,
  • 46:05but that's that probably could be a.
  • 46:08A practical example on when it's not ideal.
  • 46:12Because the software,
  • 46:13if you are trying to compare a
  • 46:1610 points versus 1000 points,
  • 46:18tries to define a common wavelength.
  • 46:21But sometimes this leads
  • 46:23to like distorted images.
  • 46:25I don't have an example to show.
  • 46:29That's good enough, thank you.
  • 46:32And well, I can leave the note.
  • 46:35Sometimes you can see also the
  • 46:37nutshack inside your box whisker,
  • 46:39so they're not sure is diesel
  • 46:42feature that it represents a measure
  • 46:44of certainty for the medium.
  • 46:47So sometimes it is useful to have.
  • 46:49These are 'cause if you are comparing
  • 46:51a lot of box whisker plots a you can
  • 46:54look at the uncertainty as if it was a
  • 46:56sort of standard error of the media.
  • 46:59And so if two box whisker overlapping,
  • 47:02they're not.
  • 47:03She's probably it means that the
  • 47:05medians are not statistically.
  • 47:08Inefficiently different.
  • 47:09This could be a way to.
  • 47:12The use of the notch or there is
  • 47:15the interpretation of the data
  • 47:16and the comparison of different
  • 47:18distribution and that's why the
  • 47:19box whisker plots are so popular,
  • 47:21because they allow you to represent
  • 47:24that distribution of data in
  • 47:25a very compact format.
  • 47:27This is another display of the
  • 47:29anatomy of Big box whisker,
  • 47:31but it doesn't add anything
  • 47:33that I had also before.
  • 47:35So here is an example where box whisker
  • 47:38plots are used in order to compare
  • 47:42the four different distributions.
  • 47:43So the advantage is that they allow
  • 47:46easy comparison so it's easy to
  • 47:49compare the distribution of ABC and D.
  • 47:52The problem they can have is that they
  • 47:55hide the shape of the distribution.
  • 47:58And also usually they hide the
  • 48:01number of points that were used
  • 48:03to build the box whisker.
  • 48:05Sometimes you can code the number
  • 48:08of points so the cardinality the
  • 48:10size of the distribution as the
  • 48:13width of the box whisker,
  • 48:15but it's rarely used because it's not
  • 48:18very visually beautiful, I would say.
  • 48:21So one solution it could be to
  • 48:23overlay over the box whisker,
  • 48:26plot the jitter plot,
  • 48:27so jitter plot represents the single
  • 48:30points that were used to build
  • 48:32the box whisker plot and they are.
  • 48:34So while on the Y axis that there
  • 48:36is the precise values on the X
  • 48:39axis there randomly.
  • 48:43Place that let's say there are
  • 48:45also methods that do not display
  • 48:47these points randomly butting up.
  • 48:49Sell the random way that captures
  • 48:50the shape of the distribution,
  • 48:52and I think that that kind of plot
  • 48:55is also present in in graph for
  • 48:58the Prisma so the advantage of this
  • 49:00is that you can see, for example,
  • 49:03that would be the distribution is bimodal.
  • 49:06So because you see that there are
  • 49:08these high densities of points and
  • 49:10the box whisker plot cannot capture
  • 49:12that you cannot see from a box,
  • 49:14whisker plot data distribution is
  • 49:17bimodal and for example here you
  • 49:19can see that these box whisker
  • 49:21plot there has been is based on
  • 49:24much less data than the others.
  • 49:26So, uh, and a solution for these
  • 49:29are is to enclose the box whisker
  • 49:32plot into a violin plot.
  • 49:35So violin plot representation
  • 49:37like these allow you to see the
  • 49:40same information of a box whisker,
  • 49:43but also information on this shape
  • 49:46of the distribution is basically
  • 49:49in a violin plot.
  • 49:50You add a density plot that is
  • 49:54parallel to the vertical axis.
  • 49:58And here, by using a violin plot you can see.
  • 50:00That this,
  • 50:01that this distribution is one pick
  • 50:03has one pick. This one is by model.
  • 50:08And you can add also the number here.
  • 50:11He said of coding the number as
  • 50:13the size of the distribution as
  • 50:15the width of the distribution.
  • 50:19So this is an example of compare
  • 50:21of comparisons between different
  • 50:23ways to show distribution.
  • 50:25Here you see the histogram with the density,
  • 50:28corresponding density plot,
  • 50:29the same distribution
  • 50:31visualized as a box plot,
  • 50:33and visualized as a violin plot that
  • 50:36captures both the features of a box
  • 50:39plot cluster the density distribution.
  • 50:42And this is for a normal distribution.
  • 50:44This is for a bimodal distribution where you
  • 50:47can see that the box plot doesn't capture,
  • 50:50so the box plot can capture the fact
  • 50:53that the data are not symmetrical and
  • 50:55you see the for example the distance from
  • 50:58the from the from the point of the box
  • 51:01and the medium is much more than these.
  • 51:04So the box whisker is good in capturing
  • 51:07a symmetrical distributions but not
  • 51:09the presence of more than one piece.
  • 51:11So not the complex shape of the distribution.
  • 51:15And there is a website here where
  • 51:17you can where you can see a lot of
  • 51:21examples where the different choice of
  • 51:23visualization can lead to different.
  • 51:26Conclusion as here.
  • 51:29It's true also that the violin Plata
  • 51:32is not efficient because you're
  • 51:35sure you're showing twice.
  • 51:37The same information,
  • 51:39so this is aesthetically pleasant,
  • 51:42but is not efficient because
  • 51:44you're repeating basically this
  • 51:46density twice above and below,
  • 51:49and so that's why there are two
  • 51:52saver for efficiency sufficiency.
  • 51:55There are recent visualization
  • 51:57strategies as the rain cloud plotter.
  • 52:00So the Raincloud plot that shows a box
  • 52:03whisker plot in the middle half violin
  • 52:06plot here and then also the single point.
  • 52:09So that's probably the one of the
  • 52:12most complete exhaustive ways to
  • 52:14represent a distribution of data.
  • 52:16And they're called the rain cloud because
  • 52:18of this effect is should be the cloud.
  • 52:20And this is the rain that falls
  • 52:22on the proposed below.
  • 52:23So you can find information on
  • 52:25how to block these are following
  • 52:28the following these link.
  • 52:29Another yeah.
  • 52:31Quick question, is there a?
  • 52:35How to say the restriction or limitation
  • 52:38as to how many data points are required
  • 52:43for generating reliable violin plot?
  • 52:50Generally not so. Probably more than 10,
  • 52:55I would say because otherwise so you can
  • 52:58see that you can see it empirically,
  • 53:00because if the data are too few you can
  • 53:03see that the violin basically have sort
  • 53:06of waves around each point of your data.
  • 53:10So as a general. Is a general threshold.
  • 53:15I would say 10 points would be the
  • 53:19like the minimum number. And asking
  • 53:21that question is of course if
  • 53:23you have a lot to data points,
  • 53:26these would be informative.
  • 53:27But if you have, let's say
  • 53:29less than 10 or small number,
  • 53:31this could be really distorting
  • 53:33or faking the. Yeah, yeah, that's
  • 53:35true. That's why I would
  • 53:37say 10 because it below 10.
  • 53:39Probably the best strategy is to show
  • 53:41the single points and then a summary
  • 53:44such as the mean or median plus
  • 53:46some validation standard dialogue,
  • 53:48but not the not the distribution
  • 53:50as a violin plot.
  • 53:51So that's for a like less than 10 data.
  • 53:57Alright, when data are too much,
  • 53:59for example, it doesn't make
  • 54:00sense to show the single points.
  • 54:03Because that they are overlap,
  • 54:04they overlap each other and so you
  • 54:06don't see anything that happens
  • 54:08when you have more than 1000 points,
  • 54:10and so the best solution in that case
  • 54:12is for example to show only the violin.
  • 54:18So there is a Ranger for which.
  • 54:22The best solution is to show the
  • 54:24single data points with the cross bar,
  • 54:27so an element with captures mean or
  • 54:30median plus standard deviation order.
  • 54:33Confidence interval there is a
  • 54:35Ranger that is in the middle from 10
  • 54:38to some hundreds where the violin
  • 54:40plot and the box whisker plot are
  • 54:43the best option to visualize.
  • 54:44And when you have many,
  • 54:47many data more than 1000, probably.
  • 54:48If you want to capture the distribution
  • 54:51then only there the violin plot rather
  • 54:53than the single points is the best way.
  • 55:01Did did he? Did it answer?
  • 55:06Yeah, that was awesome.
  • 55:08That is a great explanation.
  • 55:11OK, another another alternative
  • 55:13way to maximize efficiency of the
  • 55:16violence that I saw a lot in the
  • 55:18with single cell data, for example,
  • 55:20is the the user split violin plots are,
  • 55:23so you use the violin plot to show a
  • 55:26comparison between two distributions.
  • 55:29So you see here are this plot shows the
  • 55:32representation of Asia or female and
  • 55:34males are using different social, social,
  • 55:37media, Instagram, Facebook, Twitter.
  • 55:38So it's a way to show using.
  • 55:41Half of a violin plot are differences
  • 55:44in the distributions and this can
  • 55:46be used when you have a contrast
  • 55:49of two conditions or you want to
  • 55:52compare two distributions.
  • 55:53I'm also in the single cell.
  • 55:57About the violin. Plots,
  • 55:58like in the such cases, yeah,
  • 56:00So what determines the height of the peaks?
  • 56:03Or is that everything is normalized
  • 56:05so that the total area the same,
  • 56:07or the maximum height is the same?
  • 56:10So most of the time,
  • 56:12so you have choices usually so you can
  • 56:16choose to have the same maximum hate.
  • 56:20And that's usually the then.
  • 56:22That's usually what you find,
  • 56:24so you you plot there in a way
  • 56:27that the Ranger is the same from
  • 56:30here to here from here to here,
  • 56:33the alternative is to use the
  • 56:35real criteria for a for a density,
  • 56:38and that should be that the
  • 56:41area under visa is equal to 1.
  • 56:45And so that the two have the same area.
  • 56:48An alternative is to have an
  • 56:50area that is proportional to
  • 56:52the number of observations,
  • 56:54but I think that visually most
  • 56:57of the time you find that.
  • 57:00The criteria is that you have in order
  • 57:02to have balanced plots are the criteria,
  • 57:05is to have the same Ranger.
  • 57:07Meaning from here to the maximum
  • 57:09for all their pull the plot
  • 57:11independently from the area and
  • 57:13dependently from the number of points.
  • 57:18It's not probably the best solution from
  • 57:20the point of view of communication,
  • 57:21but it's most used. OK, thank you.
  • 57:27I variation of this is also the
  • 57:30use of ridgeline plots are that.
  • 57:32They allow you to compare a
  • 57:35lot of different densities.
  • 57:37For example, here you see a comparison
  • 57:40of the density of temperatures in
  • 57:43different month in allocation metadata.
  • 57:45Remember Lincoln NE and this
  • 57:48is used in a single cell.
  • 57:52Is Alotta now in these years
  • 57:54with single cell data?
  • 57:55For example,
  • 57:56here you see that it is used to
  • 57:58compare the distribution of the
  • 58:01expression of 1 gene leads A
  • 58:03or CL5 in different population
  • 58:05of cells that are probability
  • 58:07can from some blood sample.
  • 58:10Different population and
  • 58:11these allow you to see.
  • 58:13Sorry to see an marker genes or to
  • 58:16see how the expression of a gene is
  • 58:19specific for a population of cells.
  • 58:22So that's why I included because I
  • 58:25see that the frequency of this plot,
  • 58:28specially in the single cell
  • 58:31visualization field is quite increasing.
  • 58:34I have visa section of the
  • 58:36presentation that we could skip.
  • 58:38In general the message about Visa
  • 58:40is that Venn diagrams are good
  • 58:43when you have two Venn diagrams,
  • 58:45but if they are,
  • 58:46if they're more there a bad way to
  • 58:49represent intersections between sets.
  • 58:51And this actually is a plot that
  • 58:54was published in Nature and it it's
  • 58:56about a comparison of the genome
  • 58:59of banana with other species.
  • 59:01So the problem in general is that
  • 59:04when you have more than two, 3,
  • 59:06four but also two Venn diagrams,
  • 59:09it's it's not the best way to
  • 59:11visualize intersection with the use
  • 59:13of the traditional Venn diagrams.
  • 59:15So a table is probably more effective
  • 59:18than this because the areas are
  • 59:21not proportional to the size.
  • 59:23And it's quite confusing to see
  • 59:26the specific intersection and
  • 59:28so on alternative way.
  • 59:30That was developed in the recent year
  • 59:32was the user the concept of this
  • 59:35upset plots are so to represent the
  • 59:37intersections in a matrix format.
  • 59:40So represent these are as a member
  • 59:42as a sum object.
  • 59:44Example,
  • 59:44a gene that is present on only
  • 59:46List A only list D only list C
  • 59:49intersection between AMD origin
  • 59:51present in all the intersections.
  • 59:53So you can use these matrix format
  • 59:55to show the intersections and then
  • 59:58you can display the cardinality.
  • 01:00:00Of each.
  • 01:00:01Intersection so the number of genes,
  • 01:00:03for example that are only in the PDF
  • 01:00:06error pathway that you see here.
  • 01:00:08The number of genes that are in the
  • 01:00:11common between the EGFR and P-10 path.
  • 01:00:14With that you see here.
  • 01:00:15So this is a way to show the cardinality
  • 01:00:19of the global list that you see here.
  • 01:00:22And also you can rank the intersections
  • 01:00:24between the different sets according
  • 01:00:26to their size to their personality.
  • 01:00:29So it's much more clearer.
  • 01:00:30To show the structure of the intersection.
  • 01:00:34Rather than using the. A Venn diagram.
  • 01:00:38I skip this because they are.
  • 01:00:41There were some examples of bad
  • 01:00:44usage of graphic in politics.
  • 01:00:47And a lot are looking at online
  • 01:00:50and related to Fox News.
  • 01:00:52Of bad usage of klasa display.
  • 01:00:54So the final part could be how to
  • 01:00:56draw this pad. There is
  • 01:00:58relative they were trying to.
  • 01:01:00Not make the point. Yeah,
  • 01:01:02well they were trying to make him
  • 01:01:04to give a message by distorting the.
  • 01:01:09Yeah, this for example is an
  • 01:01:11issue if you always need to
  • 01:01:13include the zero in your plot.
  • 01:01:15Sir, this is controversial.
  • 01:01:17Let's say that. In general,
  • 01:01:19in Barplots it's a bad idea,
  • 01:01:21but for example is a good
  • 01:01:23idea in in time series,
  • 01:01:25and that's because there in barplots
  • 01:01:27the height of the bar plot is that
  • 01:01:30your main message of the figure,
  • 01:01:32while for example here in in a
  • 01:01:34time series that the main message
  • 01:01:36is how the two trajectories
  • 01:01:38evolve and are interconnected.
  • 01:01:40So the main issue is the horizontal
  • 01:01:43axis and so you can skip the zero.
  • 01:01:47So again,
  • 01:01:47it depends on how much these inclusion
  • 01:01:50or exclusion of the zero distort
  • 01:01:53your your main message of the fever.
  • 01:01:56So how to draw plots?
  • 01:01:58So here are there is an outline
  • 01:02:00of the software that you have,
  • 01:02:03so this is some commercial
  • 01:02:05software from the most from Excel.
  • 01:02:07It's probably the most used or available,
  • 01:02:10but it doesn't allow to plot all the
  • 01:02:13solutions that they did show before, but.
  • 01:02:16For example,
  • 01:02:17Grandpa,
  • 01:02:18Graphpad prism,
  • 01:02:18or Origin Pro are through software
  • 01:02:21that are available and with those
  • 01:02:23that you should be able in an
  • 01:02:26environment that is similar to Excel
  • 01:02:28to produce most of the plots that
  • 01:02:30you saw in the presentation today.
  • 01:02:33So this is commercial software,
  • 01:02:35doesn't require programming
  • 01:02:36skill on these sides.
  • 01:02:38Are you see the main solutions
  • 01:02:40that are used by data scientists,
  • 01:02:42but that require programming
  • 01:02:44skills so that the two most
  • 01:02:46common languages in data science,
  • 01:02:48RR and Python so far are you have is GG plot.
  • 01:02:53Library for Python.
  • 01:02:54You have matplotlib or Seaborn.
  • 01:02:58At these require programming so,
  • 01:02:59but I would say that the advantage
  • 01:03:02nowadays of using visa is that you can
  • 01:03:05find a lot of really a lot of examples.
  • 01:03:09Because there are a lot of website
  • 01:03:12that where you can choose that
  • 01:03:15you're like data visualization
  • 01:03:17type and you see already the code.
  • 01:03:20That you can use in order
  • 01:03:22to produce the blocked.
  • 01:03:23So I would say that you just need
  • 01:03:26to know how to insert that or how to
  • 01:03:28load that in the this programming
  • 01:03:31environment table of data.
  • 01:03:33And then most of the difficulties
  • 01:03:35are probably in fixing details,
  • 01:03:37so it's very easy to realize the plot,
  • 01:03:40different plot.
  • 01:03:41It's more complicated to adapt the
  • 01:03:44small things that we are to your taste.
  • 01:03:48But so so,
  • 01:03:48this suggestion is that if you
  • 01:03:50do a lot of visualization,
  • 01:03:52it's worth investing in this.
  • 01:03:57Here you see a maybe a future perspective
  • 01:04:01that could be their own online solution.
  • 01:04:04They're already available so summer,
  • 01:04:06for example. You can produce upset plots,
  • 01:04:09Aurora rain plot, Sir,
  • 01:04:10or some other like exotic type
  • 01:04:13of data visualization online.
  • 01:04:15So there are websites,
  • 01:04:17web web servers where you can insert
  • 01:04:20your data as tables and they produce at
  • 01:04:23the data that you want and you have.
  • 01:04:27Some sort of interactivity,
  • 01:04:28so that could be the future. Sure.
  • 01:04:32Where are web servers provide you with
  • 01:04:34the main programming environment?
  • 01:04:36You need just to interfere data
  • 01:04:38and you can see by interactively.
  • 01:04:44By the interaction with the web server.
  • 01:04:46How to customize the data?
  • 01:04:48Most of the solutions right now are.
  • 01:04:53Commercial, and so you need to pay,
  • 01:04:56and that's the drawback of this.
  • 01:04:58But it could be probably the
  • 01:05:00future of matching the programming
  • 01:05:02with easiness of usage.
  • 01:05:06This is a useful resource that
  • 01:05:08you can use also to decide which
  • 01:05:11kind of blocked are you want.
  • 01:05:13So there are a lot of these trees are that
  • 01:05:16depending on what you want to represent,
  • 01:05:19one numeric variable to numeric
  • 01:05:21variables or categorical variables,
  • 01:05:22you can follow the tree and arrive
  • 01:05:24to your to the best graphical
  • 01:05:26solutions to display your data.
  • 01:05:28So I suggest you to visit it
  • 01:05:30also to look at what are the
  • 01:05:33kind of possibilities for data
  • 01:05:35representations that you have online.
  • 01:05:37There are many of these sites and
  • 01:05:39now and that's why it's easy to
  • 01:05:42look at the documentation and also
  • 01:05:44to retrieve and reproduce the code.
  • 01:05:46This is another example I closed with Visa.
  • 01:05:52Patricia, that I find particularly
  • 01:05:54related to data visualization and
  • 01:05:57science is not natural itself about
  • 01:06:00its nature under our observation.
  • 01:06:03And so the science of data visualization
  • 01:06:05is a way to allow more adherence between
  • 01:06:10observation science visualization.
  • 01:06:17Thank you come on.