Analysing the Current iGEM Distribution
Introduction
One of the major challenges we were aware of from the get-go was the extensive planning necessary to get the genetic parts from team A to team B. While the generous DNA synthesis offer from companies like IDT and Twist are incredibly helpful in setting up such an endeavor, it still requires a variety of logistical considerations that are both time-intensive and require long-distance shipping. Since many of the constructs built by iGEM teams utilize a set of most common parts, the use of a standardized part collection could address these issues while significantly reducing both overall cost and time.
One of the obvious candidates for such a collection is the iGEM Distribution, which is sent to all participating teams each year. Since its initial creation in 2006, the distribution has been an essential part of the iGEM experience and builds the foundation of projects from teams all over the world. As the community has grown, so has the distribution, which now contains over 2000 of the best parts in the competition. However, in a scientific discipline as rapidly evolving as synthetic biology, it is evident that after 15 successful years a new generation of distribution is needed.
An opinion that is also shared by the iGEM Foundation, which has announced its goal to create the second generation of the distribution, the world-best biotechnology toolkit supporting both education and innovation. With such ambitious goals in mind, the iGEM Foundation turned to the community and asked this year's teams for their honest feedback on the current distribution and their wishes for the Distribution 2.0.
We recognized this unique opportunity and thoroughly investigated the current distribution. While doing so, we were not only interested in examining the distribution in the context of our project, but also in potential obstacles that could hinder the widespread adoption of such a collection.
Analysis
For this analysis, we reflected on our own decisions during the planning stage of our projects and asked ourselves the questions of why we did not incorporate the 2021 distribution to a greater extent in our project. We quickly realized that one of the main reasons is the relatively difficult and non intuitive way in which teams currently interact with the distribution online.
Something that can pose a major challenge, especially for new iGEM teams with no previous experience. An opinion that was also echoed by several other teams we’ve met during this year's meetups. Encouraged by these discoveries, we have taken an in-depth look at this problem and identified 2 key issues:
- Missing or conflicting relevant information
- Lack of user-centered design
To get information about the parts contained on the 6 distribution plates, teams currently have to click through a series of sites to arrive at a plate-specific page that contains the relevant data. Albeit useful, the information contained here is rather limited and is mostly of use to find the location of already selected parts. Somewhat concealed, users can find the link “Get a detailed Excel file for this plate” which links to the corresponding CSV file.
Curious about possible differences between the two versions, we compared them against each other and found that one contains in fact more parts than the other. This discrepancy originates from parts that are no longer included in the distribution due to copyright concerns. The finding was immediately reported to the registry. Although the CSV file contains more information than the html table, a closer look reveals that a large part of the information is still missing.
Out of the total 15 columns, 4 columns (Resistance, Gel Overall, Quantity, Seq Comment) contain no information, 2 columns (Sequencing, Well Status) contain just 1 unique value and 1 column (Plasmid) is duplicate.
As some of this data can be essential for the use of these parts, we started to hunt for the missing information on the registry. Since the overview page also contains a so-called antibiotic file we started our search here. However, our hopes quickly came to a halt when we opened the corresponding file and found that the necessary information was not included here either. The same, unfortunately, applies to the other datasheets that are linked to the respective plate overviews.
After this temporary setback, we expanded our search to past distributions, since the 2021 distribution is identical to the 2019 one. While the 2019 datasheet contains, in fact, more data, it also harbors another troubling discovery in the form of sequencing data. The registry generally differentiates sequencing results into one of 7 categories:
- Confirmed
- Partially Confirmed - software is only able to partially confirm sequence, most likely due to one read being poor
- Long Part - length of sequence reads are insufficient to cover the middle of the part
- Inconsistent - part does not match its target sequence, may have a single bp mutation or not match at all
- Bad Sequence - usually caused by low DNA concentration or incorrect primers
- Single Error
- No Sequencing Information
Interested in how often the respective categories occur, we have created a visualization tool that colors the respective positions on the 384well plate for each of the parts.
Analysis
Considering that even a single point mutation can substantially change the behavior of parts, we have decided to simplify the graph and split it into one of 2 categories - confirmed and not confirmed.
Analysis
The resulting graph illustrates, in our opinion, a major problem of the current distribution. Given that teams may build their project on constructs that use one or more of these parts, it is of the utmost importance to ensure that they can rely upon the sequence given on the part page. Failure to do so can, in the worst case, lead to non-functioning constructs and a substantial time loss due to troubleshooting the associated problems.
Putting this issue aside, we once again concentrated on the part overview. As previously mentioned, we believe that the other major issue with the current distribution is a lack of user-centered design. In particular, we want to highlight the discrepancy between the information design as it currently exists and the workflows that teams want to use.
Most teams will first come into contact with the distribution page between the project ideation and planning stage. It is at this time when a significant amount of decisions are made as to what parts will be used as building blocks for the project. Sadly, the distribution is oftentimes missing this chance to influence the team’s decisions. We believe that this is due to the fact that the information as it is presented now is difficult to incorporate into the project planning.
One of the main reasons for this is the lack of important information such as part type, source organism, and cloning compatibility in both the HTML as well as CSV overview. To use the distribution to its full extent currently requires downloading and analyzing all 6 CSV files, searching and opening up each of the >2000 part pages, and cross-referencing all these relevant resources. Evidently, something that is virtually impossible without the use of additional tools.
Adding to this, many users want an easy way to download the parts included in the distribution in standard formats such as GenBank or SBOL, something that is currently only possible with the help of third-party providers. The absence of plasmid maps further poses an unnecessary hurdle for many teams, something which should be addressed by future distributions.
To address some of these issues, we created an in silico version of the 2021 distribution that contains all parts, including sequence annotation, as a GenBank file that can be found on our GitHub.
Furthermore, we searched for all plasmids that are used in the distribution, downloaded the corresponding sequences, and reannotated them using bioinformatics tools. Something that was necessary since many plasmids had either missing or no annotations at all. The resulting plasmids can also be downloaded from our GitHub.
By bundling all these efforts into one tool, we hope to show a potential path for the next generation of the distribution. A tool that can be flexibly adapted to the needs of each team, opening up valuable information to as many people as possible. To accomplish this task, we examined which tools iGEM already successfully deployed and came across Airtable, an application iGEM already uses for the so-called phoenix track. In their own words, Airtable is a low-code platform for building collaborative apps which allows its user to customize workflows, collaborate, and achieve ambitious outcomes. Something that perfectly fits our above-mentioned requirements.
To complete the tool, we additionally have analyzed all sequences and identified the most commonly used restriction enzyme recognition sites, and included relevant antibiotic data. We proudly present:
Analysis
To allow the tool to be both updated as well as adapted in a different context, we decided to make the data represented here also freely available as a CSV file.
To conclude our analysis, we would like to draw attention to one additional issue that is not just limited to the distribution but also the registry as a whole, so-called twin parts. The official definition reads as follows:
Two or more parts are twins if they have the same DNA sequence
We first noticed the appearance of these closely related parts while thoroughly analyzing the distribution sequences. Although these duplications seem harmless at first, they can pose unexpected challenges for future iGEM teams. During a deeper analysis of the iGEM registry database, we became aware of parts that have no less than 50 twins, making it extremely difficult to collect all the available data for those parts. However, the problems do not stop here.
A potential headache many teams face each year is the apparent lack of some of the most used parts in the distribution. This assumption can be clarified, however, if we take a look at all the twin parts. Here we discover that some of the most used parts are not included under their actual name, but under that of a twin part instead - a twin which has often much fewer usages than the original part.
One such example is the double terminator BBa_B0014. Although being one of the most used terminators, it is not directly included in the distribution. Instead, you can find the sequence identical BBa_K823017 on plate 1 of the 2021 distribution, a part that has about 10 times fewer registered uses than the original one. Unfortunately, this is not a unique case, but one of several occurrences where the less frequently used twin part is included instead of the much easier-to-find most used part.
On closer inspection, the distribution itself also contains several sequence identical parts under different names. Here we list some of them:
Dealing with these twin parts will undoubtedly be a major challenge for future implementations of the registry and distribution and will require some critical decisions on how to handle the already existing ones.
Discussion
With this analysis in hand, we set out to contact the iGEM HQ and arranged a zoom call with the director of the registry - Vinoo Selvarajah. In this meeting, we summarized our results in a short 30-minute presentation. The ensuing discussion gave us valuable insights into the complexity of distribution, its inner workings, and steps that are currently taken to improve it.
We were excited to see that our concerns and analyses were taken and how open iGEM is to feedback. Moreover, we were very impressed by how seriously the wishes and opinions of the community are treated with regard to the new distribution.
Ultimately, we hope that our work this year will have a positive impact on the new distribution and enable future iGEM teams and scientists around the world to utilize this toolkit to its fullest extent.