Best Measurement
Good Practice in Data Analysis
Biology heavily depends on statistics as a tool to attach significance to findings. As a result, data analysis is an essential skill for natural scientists. Nevertheless, it is sometimes regarded as a chore and can be performed mindlessly, following conventions. But because data analysis is the foundation on which we build our conclusions, it is important that one takes time to do the data analysis and thinks about the effect every step has.
This is why we in the dry lab have focused solely on the task of data analysis. In order to ensure that we draw valid conclusions and communicate them properly, we put great care into every step of the process. Perhaps it comes from our initial inexperience but we questioned every step we took and consulted our supervisors when we were uncertain. We took many measures to ensure that our data analysis was sound, so that our team could contribute to the world with confidence.
Note
Because the Biology major in our university has a mandatory data analysis course in which we used R Studio extensively, we also used it as a coding environment (Version 1.4.1106, R: 4.0.4) for iGEM. The basis of our knowledge was also gained in that course.
Reduction of Variability
For our project, we often performed ROS assays, which are plant immunity assays that measure the reactive oxygen species burst in response to an elicitor. As we were doing our ROS analysis for the first time, we noticed that the responses to the treatments differed greatly among repetitions (Fig. 1&2 ). Worried that we had done something wrong or the experiment didn’t work, we asked the wet lab whether that was normal.
Fig. 1: Graphs showing ROS burst of each repetition and treatment (Col-0)
Fig. 2: Graphs showing ROS burst of each repetition and treatment (efr1)
They ensured us that ROS assays were intrinsically variable and that is why they had made 12 repetitions per sample in each experiment. This is a practice that we placed a lot of value on throughout our experiments, since biological systems are inherently variable. Repetitions are vital to reduce noise and allow us to make more reliable conclusions. ROS assays are an extreme case regarding variability, so it would be good to use even more repetitions, since it would give us more confidence in our findings.
Why does increasing the number of repetitions increase confidence?
The reason is that all measurements are only a snapshot of a state a system is found in. It is difficult to tell whether this specific state is in the normal range or not. Perhaps one just measured a rare condition! By increasing the number of repetitions, one can record the normal range of values. Essentially, the measurements are put into the context of the system one is analysing.
Data Wrangling
During our analyses, we often encountered statistical outliers which came from contamination of the sample. The wet lab always clearly marked such samples, which made the assessment whether to keep or discard them much easier. We often created boxplots and histograms to see whether there were any strong outliers (Fig. 3 ). When we stumbled on measurements that stood out, we could check with the original dataset to see whether the particular measurement was contaminated. When we were unsure whether we should keep a measurement or not because the deviation was not large or the contamination wasn’t strong, we kept it in the data set. It is important not to change the collected data whenever possible, as this could result in biased results. Handling outliers is a delicate issue and we also consulted our supervisors when we were in doubt, as they are much more experienced than we are.
Fig. 3: Example histogram from SGI where a Mock measurement was rotten.
Because we were not experienced with data analysis in the beginning, we often worked on separate computers and performed the analyses in parallel. This served as a safety net and to ensure that the difference in operating systems has no effect on the execution of the code. When we reached different results, we discussed our approaches to find the more adequate one for a particular analysis. This system allowed us to double check our steps and catch our mistakes, as well as to optimise our approach.
Balanced designs are a prerequisite for an ANOVA with an interaction. However, we were sometimes confronted with an unbalanced design. In such cases, we carefully considered what our research question was and removed categories from the data set so that the design is balanced, while still making sure that the relevant information can be gained.
The data taken out can sadly not be used, unless it can be evaluated completely separately from the other measured data. We cannot reuse the other measurements for the analysis because this implies several assumptions that result in a theoretical duplication on increase of data points, even though we didn’t collect them. Though we are unsure of the exact principles behind this, we believe that it is a complex combination of inherent variability of nature, as well as stochastics and the nature of data sets.
Why can we not perform multiple tests on one dataset?
Focusing on single data points
This is a question that we discussed time and time again, but never really found an answer to. We were told that performing multiple tests on a dataset is tantamount to creating data you did not collect. This implies that every measurement can only be 'attributed' to one statistical test (and thus hypothesis). This could for example be because things like weights that we were working with are often normally distributed. In purely mathematical terms, this means that the probability of measuring that exact weight is equal to 0 since the scale is continuous. What we arrived at here is that every data point is unique and thus, can only be measured and used once. Another aspect could be that the measurement is only a snapshot in time. Every point is the product of hundreds of things that just so happened to produce that exact measurement. Using such a measurement to explain multiple things implies the assumption that the measurement is independent from everything else and exists without changing. This, of course, is not the case. Additionally, the other parameters that influence the range of values a measurement can take are often actively manipulated by the person conducting experiments. The reason being that experiments are set up in such a way that the data can answer a certain, predetermined question. This means, that the conditions under which a data point is collected are not the same if the question was different. In effect, there is already a certain bias in the measurements which would be out of place if they were to be used to answer a different question.
Why can we not perform multiple tests on one dataset?
Focusing on whole data sets
One could also look at this from the data set point of view. A measurement never comes alone - it is placed into a certain context by the other data points that are collected during the same experiment. The combination of all these measurements is what allows us to either accept or reject the null hypothesis. Perhaps it is exactly this context that makes it impossible to draw other conclusions from the exact same data set.
Controls
Controls are an essential part of the experimental design. We always had a positive and negative control where it made sense. Though we used the positive controls in the plots so as to inform the viewer that nothing was faulty in our setup, we decided to remove them from our statistical tests. The reason for this is that we used an ANOVA for most of our analyses. This means that the means of each treatment group are compared to each other and the results indicate whether at least one of the means is different from the reference category (neg. control). The problem with the positive control in such a setup is that it is intrinsically different from the negative control and thus would greatly influence the statistical test’s outcome. In effect, this would render the ANOVA useless and we would only have the post-hoc test (TukeyHSD) for the evaluation of data. Post-hoc tests however can only be used to make new hypotheses, so we wouldn’t be able to draw conclusions from our experiments. As scientists, this would have been an impossible sacrifice to make, so we removed the positive controls in statistical tests.
Fitting Models
Fitting linear models comes with assumptions that need to be met:
- The expected value of the residuals is 0
- All residuals have the same variance
- All residuals are normally distributed
- The residuals are independent
To ensure that the models we fitted were appropriate for the data, we always checked the diagnostic plots to see whether all the modelling assumptions were met (Figures 4-7 ). Because the assumption of normality was not met in ROS assays, we opted for the transformation of the response variable in our model (total fluorescence) for the data using the natural logarithm. However, for legibility reasons, we used the untransformed data for the plots.
Fig. 4: Tukey-Anscombe plot
Fig. 5: QQ-plot
Fig. 6: Scale-location plot
Fig. 7: Leverage plot
What do plots 4 and 5 tell us?
Fig. 4: Tukey-Anscombe plot
When one fits a linear model, one can imagine a line running through the measurements. This line should describe the trend as well as possible. This plot shows how big the differences between your measurements and this line are. We are looking for a regular scatter of points. The plot can show whether the assumptions 1,2 and 4 are met.
Fig. 5: QQ-plot
The QQ-plot compares quantiles of a normal distribution to the quantiles of the distribution of the differences of your measurements to the fitted line describe above. It checks whether assumption 3 is true. If your QQ-plot was not close to a diagonal line, this means that your measurements are not well described by the line you fitted and a transformation might be necessary.
What do plots 6 and 7 tell us?
Fig. 6: Scale-location plot
If you see a trend here, it shows that the variances of the residuals change with the fitted values. If you remember assumption 3, you can tell that the variances shouldn't change, so no trend would be ideal.
Fig. 7: Leverage plot
The leverage plot shows how much 'power' a point has, or how high its influence is on the model. The leverage of a point is expected to be high, when it's far away from the mean. This plot (as well as the others) can be used to identify potential outliers.
Drawing Conclusions
We made sure to not only focus on and report p-values, but also F-statistics, and R2 values since they can give valuable information. While R2 values show how well our explanatory variables explains the variability in the response variable, F-statistics can show us how high the variability between groups is compared to within groups. A low F-statistic means that the difference in variability between and within groups is not high, which implies that the categories we tested for are not that different from each other. Since we mainly performed ANOVA, both of the values can tell us how good our explanatory variables are at explaining the way the response variable fluctuates.
When the result of an ANOVA is significant, we cannot say which group it is that is significantly different. This is why people often perform post-hoc tests. As mentioned above, post-hoc tests do not allow us to make statistical conclusions either, but they can be used to formulate new hypotheses. They can thus be an indication for the more detailed ‘mechanisms’ behind the significance of the ANOVA, but the results are to be handled with care. One of our advisors uses Dunnett’s tests on the Prism software, but we found that Dunnett’s tests are used to compare all means to the reference category. This is not fitting for our use because we want to compare all groups to one another. We decided to talk to Prof. Owen Petchey, who was the lecturer of our data analysis course, suggested we use the TukeyHSD. It is a test that compares all means to one another and uses a family wise error rate to correct for multiple testing. This means that our confidence level across all pairwise tests is set at 95%. Because this test ticks all our boxes, we decided to use this test for our analyses.
What is the multiple testing problem?
The multiple testing problem arises when one performs many pairwise tests like in the TukeyHSD. Every time one tests between one pair, there is a probability of 5% (a) that one rejects the null hypothesis, even though it is right (Type I error). If one then performs n tests:
P(at least one false significant result) = 1 - P(no false significant result)
P(at least one false significant result) = 1 - (1 - a)n
One can see that the probability of making a Type I error rises quickly as n increases, meaning that the confidence of the results becomes smaller. This is often corrected for by setting a smaller 'a' for every test, the higher n is.
Communication of Results
It is easy to get lost in the heaps of numbers created during data analysis. It was important to us that we could always close the loop to biological significance. Looking at the p-values and R2s, we asked ourselves what the result means for us and whether it makes sense at all. For the plots, we focused on adding as much information as possible, while maintaining legibility. The aim was to give the viewer all the information needed in the plots to understand what our data shows. For the error bars, we decided to use standard errors, since they provide the most easily interpretable representation of errors. In some experiments, the setup was so big that we had to use several plates. This also meant that we had several controls. Though it isn’t necessary to standardise our data from different plates to ensure comparability because all the conditions for something to count as one experiment were met (i.e. plants of the same age, time of treatment and measurement etc.), we decided to do it in some cases to simplify comparison by eye. To do so, we standardised every plate by the respective negative control which generates relative results.
Facilitating Data Analysis
While we worked on our project, we came to realise how time consuming and difficult it could be to perform a data analysis. Our Wet Lab Team performed the same assays repeatedly and we believe that being consistent with the evaluation of the same type of experiments is of great importance to report congruous results. Therefore, we have written a code containing many functions to facilitate evaluating a dataset with a specific structure more efficiently. We are sure that we are not the only ones who are not well versed in data analysis at the start of iGEM because often, universities tend to focus on practical work.
When we were looking for ways to analyse Reactive Oxygen Species (ROS) burst assays, we could not find a lot of information. That is why we have made our code for analysing ROS data in RStudio available as a text file for everyone as well as a file where every function is documented. For easier use, we also contribute an example of a dataset (and an annotated version) with the correct structure of the code (you can download everything here). Our short and simple protocol for a ROS assay that you can find on our protocols page. This way, future iGEM teams that work on plant immunity can perform their ROS assays more easily and hopefully benefit from the work we did!