Best Measurement

Good Practice in Data Analysis

Biology heavily depends on statistics as a tool to attach significance to findings. As a result, data analysis is an essential skill for natural scientists. Nevertheless, it is sometimes regarded as a chore and can be performed mindlessly, following conventions. But because data analysis is the foundation on which we build our conclusions, it is important that one takes time to do the data analysis and thinks about the effect every step has.
This is why we in the dry lab have focused solely on the task of data analysis. In order to ensure that we draw valid conclusions and communicate them properly, we put great care into every step of the process. Perhaps it comes from our initial inexperience but we questioned every step we took and consulted our supervisors when we were uncertain. We took many measures to ensure that our data analysis was sound, so that our team could contribute to the world with confidence.

Note

Because the Biology major in our university has a mandatory data analysis course in which we used R Studio extensively, we also used it as a coding environment (Version 1.4.1106, R: 4.0.4) for iGEM. The basis of our knowledge was also gained in that course.

Reduction of Variability
Data Wrangling
Controls
Fitting Models
Drawing Conclusions
Communication of Results
Facilitating Data Analysis

Reduction of Variability

For our project, we often performed ROS assays, which are plant immunity assays that measure the reactive oxygen species burst in response to an elicitor. As we were doing our ROS analysis for the first time, we noticed that the responses to the treatments differed greatly among repetitions (Fig. 1&2 ). Worried that we had done something wrong or the experiment didn’t work, we asked the wet lab whether that was normal.

Fig. 1: Graphs showing ROS burst of each repetition and treatment (Col-0)

Fig. 2: Graphs showing ROS burst of each repetition and treatment (efr1)

They ensured us that ROS assays were intrinsically variable and that is why they had made 12 repetitions per sample in each experiment. This is a practice that we placed a lot of value on throughout our experiments, since biological systems are inherently variable. Repetitions are vital to reduce noise and allow us to make more reliable conclusions. ROS assays are an extreme case regarding variability, so it would be good to use even more repetitions, since it would give us more confidence in our findings.

Data Wrangling

During our analyses, we often encountered statistical outliers which came from contamination of the sample. The wet lab always clearly marked such samples, which made the assessment whether to keep or discard them much easier. We often created boxplots and histograms to see whether there were any strong outliers (Fig. 3 ). When we stumbled on measurements that stood out, we could check with the original dataset to see whether the particular measurement was contaminated. When we were unsure whether we should keep a measurement or not because the deviation was not large or the contamination wasn’t strong, we kept it in the data set. It is important not to change the collected data whenever possible, as this could result in biased results. Handling outliers is a delicate issue and we also consulted our supervisors when we were in doubt, as they are much more experienced than we are.

Fig. 3: Example histogram from SGI where a Mock measurement was rotten.

Because we were not experienced with data analysis in the beginning, we often worked on separate computers and performed the analyses in parallel. This served as a safety net and to ensure that the difference in operating systems has no effect on the execution of the code. When we reached different results, we discussed our approaches to find the more adequate one for a particular analysis. This system allowed us to double check our steps and catch our mistakes, as well as to optimise our approach.

Balanced designs are a prerequisite for an ANOVA with an interaction. However, we were sometimes confronted with an unbalanced design. In such cases, we carefully considered what our research question was and removed categories from the data set so that the design is balanced, while still making sure that the relevant information can be gained.

The data taken out can sadly not be used, unless it can be evaluated completely separately from the other measured data. We cannot reuse the other measurements for the analysis because this implies several assumptions that result in a theoretical duplication on increase of data points, even though we didn’t collect them. Though we are unsure of the exact principles behind this, we believe that it is a complex combination of inherent variability of nature, as well as stochastics and the nature of data sets.

Team:UZurich/Measurement

Best Measurement

Good Practice in Data Analysis

Note

Reduction of Variability

Data Wrangling

Controls

Fitting Models

Drawing Conclusions

Communication of Results

Facilitating Data Analysis

go to top

Team:UZurich/Measurement

Best Measurement

Good Practice in Data Analysis

Note

Reduction of Variability

More about the importance of repetitions

Why does increasing the number of repetitions increase confidence?

Data Wrangling

Thought box - data points

Why can we not perform multiple tests on one dataset?

Focusing on single data points

Thought box - data sets

Why can we not perform multiple tests on one dataset?

Focusing on whole data sets

Controls

Fitting Models

What do plots 4 & 5 tell us?

What do plots 4 and 5 tell us?

What do plots 6 & 7 tell us?

What do plots 6 and 7 tell us?

Drawing Conclusions

The multiple testing problem

What is the multiple testing problem?

Communication of Results

Facilitating Data Analysis

go to top