Team:Fudan/Software

Updated on 2021-11-18: For the Search, please use the fix branch at https://github.com/FDUiGEM2021Software/Search and follow the README. The code was pushed to github before the freeze, but unfortuanlly we neglected to update the drawing with correct file names (now fixed).

On 2021-11-4: We used two new animations during our Judging Session: Search gp2 and TreeMap in a browser window.

# Highlights

  • A new Search Engine (Part Camera, specially designed for iGEM registered parts which not only expels more usable details when do searching but also shows both the direct and potential links and relations between parts in a direct way.

  • Carefully processed data gotten directly from every part's website along with a reporting system guaranteeing Part Camera's work. And our tool is tested by a tailored tester "PartCamTester" as well as our practical usage during our wet lab experiments

  • Our result may help optimize the design of distribution kits and almost every iGEM team with wet lab experiments may use Part Camera to find inspiration when brainstorming at the designing stage of their parts, to check their uploading parts, and to make full use of registered parts and distribution kits.

  • We reserve original data in .fasta, .xlsx, .json modes, as well as every Part Camera’s function in python files so that future teams can take our results for further development or testing. And the process of part or sequence meets the standard of SequenceRecord in BioPython, allowing for further processing and DIY modules.

  • Part Camera has a detailed user manual and an instructive interaction interface which is lite and user-friendly, thus allowing anyone can have easy access to it.

pic

Figure 1. The searching results of 'GFP' with Part Camera

# The Problem We Want to Solve

The idea of developing a software tool focus on parts inter-relationship origins from our experiment design. When our team started our design, we first focus on gp2-a protein that may satisfy our bio-circuit demands. But as our experiment went on, we found that gp2 was not as suitable as we desired. Let's call the challenge we met as "gp2 Puzzle" for now. At that moment, we urged to find a basic part, which shares a similar function with gp2, or a regulatory part tested with gp2 before, or an improved part based on gp2.

But after browsing the whole community search tools and official part tools, we failed to find one which can be capable of this mission. Thus, we decided to build one software tool which not only can show relevant parts directly but also gives out the most demanded details about a part for a wet lab experiment designer.

# No Easy Way

  1. In experiment design

Our software actually helped us find one resolution for the "gp2 puzzle". With the searching function, we found BBa_K1893016(pBAD + gp2). With its already-made inducing condition data, we successfully find the desirable inducing conditions and made gp2 works without time-wasting testing or, giving up gp2——which means throwing our design and ideas as trash.

pic

Figure 2. The searching results of 'gp2' with Part Camera

We find a breakthrough for our experiment barrier through pBAD :: gp2, the topmost searching result.

  1. In registering periods.

When we are about to register our parts, we use Part Camera for a reconfirmation of reptation. We let our part go through blast modules. Take BBa_K3790086 as an example, our results data show the overlappings with the exact parts before with their part BBa ID, which CANNOT be found in iGEM official blast results and part imfos on their pages. implying that our research results can be added to iGEM part information as a supplementary.

pic
Figure 3. The direct output of Part Camera's BLAST inputting BBa_K3790086

We found many similar parts with IDs( marked by a red box) that are not shown from official BLAST tools.

# Our Software: Part Camera

Part Camera is a search engine for the iGEM part repository developed by python. The keyword most distinct our software tool from the others is ‘Links Between Parts’. We focus on analyzing and demonstrating the relationships between parts, thus giving users ideas on how this part can be used and how to create a new efficient composite part with potential linking part, besides a brief view of the basic information the experiment designers desired most.

Part Camera contains two major pillars: the database containing whole parts info and the functional functions ready to be called by an interface.

The database is set by a crawler using selenium, a package for web testing, enabling us to reach detailed information of registered parts, including the part's id(e.g BBa_K3790086), part's sequence, part's description, etc. It is worth mentioning that we also place in some potential features that cannot be got directly from part wiki such as citations between parts, used times, and so on. These features also show the links between parts and the significance of a part. Meanwhile, we use fp-growth algorithm to find the potential links of two parts without direct links based on their citations. And our database is saved in .fasta, .json and .xlsx modes, making it easy for both reading and processing from programs and direct check from users.

The functional modules embedded in our interface are mainly BLAST and TreeMap. The previous one is a commonly used algorithm for finding similar sequence, and now it is used to enable Part Camera’s searching for sequences. And the latter one is an interactable TreeMap generated by Pyecharts. TreeMap allows users to have an intuitive view of the relations between parts. All functions and data are integrated into our interface. With a server, everyone can access Part Camera through the internet, or you can reach Part Camera through GitHub.

# Manual Book

For a quick start, you may read the user manual book here! you can also find it in the project directory

pic
pic

# Design

In the design part, in order to have a more straightforward image of our whole program, we will take a search for BBa_K3606036 in a user's view and track the whole data flow as an example to show you how a searching process is done.

# Set the Database

pic
Figure 4. the program flow chart of setting the database

Before the search is started, some preparations have to be done——set the database. But usually, users don't need to do this on their computer since 2021 iGEM Team Fudan will help you set the Database and you may just reach them on Github or the linkings at the end of this page! see more in the ''Get our data here" section.

File 'get_parts_data.py' is specially designed for setting the database, when being run, the central part -Module Selenium starts to work. Selenium is a suite of tools for automating web browsers, which allows us to get the original data for each part.

At the very beginning, we chose to use the Request module for requesting data, but after consulting our mentor, we realized that Request may burden iGEM's server, and Selenium, which serves as a web automatic testing tool, can avoid laying much stress on iGEM’s websites. Moreover, we also set timers to control the request frequency.

With the help of elements' paths, we may easily get the information we need. We started by getting the list of parts registered for each year's team and then reached every part's wiki for more details. After this procedure now we know BBa_K3606036's information as the following table. By doing this step repeatedly, we have more than 40'000 parts' detail. By saving them in excels, our original database is set now.

Part properties Example (BBa_I14044)
Part number BBa_K3606036
Part name MBP, maltose binding protein
Part url http://parts.igem.org/Part:BBa_K3606036
Short description MBP, maltose binding protein
Part type Coding
Team Fudan
Registered year 2020
Part sequence atgaaaatcgaagaaggtaaactggtaatctggat......(see full sequence in our database )
Part Page Contents MBP, maltose binding protein MBP(maltose-binding protein). Sequence and Features
Release status Not Released
Sample status Sample Not in stock
Stars 0
Twins BBa_K3606036 BBa_K2927003
Assembling standards 1 1 0 1 1

Table 1: Raw data get by

Assembling standards*: five digits represents for RFC[10], RFC[12], RFC[21], RFC[23], RFC[25], RFC[1000]. '1' stands for 'compatible', '0' stands for 'incompatible'

# Process Our Data

In order to save time for searching as well as to dig the potentials of original data, the procession of data is necessary. Our original data went through the following steps to become more valuable——exposing both the strong and weak relationships between parts.

pic
Figure 5. the program flow chart of processing data

  1. Breaking down the whole contents from the part page using Participle Method and finding the cited or mentioned parts on its wiki as a representative of the weak association.
  2. Set the web of weak associations based on the cite-cited relationship and use lists to preserve every node's result for further usage.
  3. Judge whether this part is adapted by distribution kits and extend the links between parts to between parts and distribution kits.
  4. Set the web of strong association based on the use-used relationship, and set nodes as well as preserve them as step3.
  5. Standardize the format of every record so as our database can be readable directly for users, at the same time minimizing the prone to error for following calls.
  6. Set backups in different formats (.xlsx, .fasta, .json). three formats have their different advantages, ranging from legibility to Processability so that the following teams can find the most suitable interface, increasing the scalability and generalizability of our data.
Part properties Example (BBa_K3606036)
Part number BBa_K3606036
Part name MBP, maltose binding protein
Part url http://parts.igem.org/Part:BBa_K3606036
Short description MBP, maltose binding protein
Part type Coding
Team Fudan
Registered year 2020
Part sequence atgaaaatcgaagaaggtaaactggtaatctggat......(see full sequence in our database )
Part Page Contents MBP, maltose binding protein MBP(maltose-binding protein). Sequence and Features
Release status Not Released
Sample status Sample Not in stock
Stars 0
Twins BBa_K3606036 BBa_K2927003
Assembling standards* 1 1 0 1 1
Part used BBa_K3606040 BBa_K3606041 BBa_K3606038 BBa_K3606037 BBa_K3606039
Using parts self
Length 1101
Cits None
Whether in distribution kits 0
Cited by None
Cited times 0
Scores** 1

Table 2. our data format after processing

Assembling standards*: five digits represents for RFC[10], RFC[12], RFC[21], RFC[23], RFC[25], RFC[1000]. '1' stands for 'compatible', '0' stands for 'incompatible'

Scores** : 'Scores' represents for the significance of a part. it is calculated as the following formula: Scores = 15*distribution_kits + 3*used_times + 1*cited_times

# Fp-growth

Fp-growth (frequent pattern tree) is an algorithm designed for data mining and machine learning, which is an effective way to find frequent itemsets and without candidate itemset generation. In this case, we use fp-growth to find the most frequently used parts in pairs. Fp-growth helps to expose the deep usage of parts, may help users to determine how to use the target parts. Due to the limitation of arithmetic power, we only wrote the module, more computing power is needed in practical usage.

# Searching for Target and Do BLAST

Now finally, the user can use our search engine! We set our By inputting “BBa_K3606036", the search request is sent to our program. We accept fuzzy search not only for part's id, but also short keywords and sequence! This is done by the Participle Method and the results are sorted by relevance, so usually, you may find your most desired on the top of results.

If you failed to find your answers, or there were no relevant parts in the iGEM part database, congratulations! You may set this part as a new part in iGEM Parts Storage. But you may also use our BLAST tool to have a deep search based on sequence to find similar parts. If you want to check blast history, you may find them in the project directory so that it won't take unnecessary time to do BLAST again.

# Browse Your Information in a Concise View

Ways to browse information are important for software, it largely decides the efficiency and utilization efficiency of data. We get our inspiration from google: the biggest search engine in the world, we believe that googles modes are the most tested ones and may best competently present data with a quick scanning. Thus, we imitate the presenting style of google and write a google-like web page. When searching for BBa_K3606036, the web calls backstage python file to return a few compliant parts lists and searches for details based on this list. We also have a quick show of our core feature——relationship. At the top of part blocks, we add short tags or parts, these tags may not have obvious connections with the searching ones, but they have potential links or similar usage (for example, RFP may be one tag when searching for GFP), which are gotten by Keyword Algorithm. you may click and know more about these related parts and have inspirations when browsing.

pic
Figure 6. A search block marked with corresponding info.

# TreeMap

TreeMap is the total representative of the relationship of the parts. With a clickable treemap set by Pyecharts with our data, it fully exposes the relation map of the target part, in a way that users can grab all the relationships within a glance. You may continue your exploration or brainstorming according to this map.

TreeMap

Figure 7. Clickble treemap of BBa_K3606036. This gif shows us how our treemap works.

# Features and Future

  1. Fp-tree data may help the optimization of distribution kits.

As is introduced before, fp-growth exposes the most frequently used parts in pairs. The design of distribution kits may refer to this result, for example, linking frequently used pairs in one plasmid to save wells of plates, adding new useable twins into distribution kits, making distribution kits more useful, multi-optional, and less expensive in the manufacturing process.

  1. Seqr formats and multi-format databases enable the extension to other bioinformatic modules, making easy DIY functions possible.

In our database, parts data are transformed into three modules including .fasta, which is a commonly used format in sequence processing, helping other bioinformatic modules can easily access the database. Meanwhile, in our data stream, every part is transformed into Seqr format, a universal format for BioPython. So you may add your desired functions and modules with a few lines of codes and integrate this new function into our PartCam.

  1. Better searching engine, better ideas

Inspiring and sparking ideas by links is what PratCam most concerns. Just like how Fudan2021 iGEM teams get their ideas, we believe that more and more teams will draw their sparks from PartCam along with the linkings between parts PartCam set and making their parts into great meaningful projects.

# Get Part Camera

https://github.com/FDUiGEM2021Software

#One-line Highlights

  • Part Camera, designed for iGEM registered parts, displays relations between parts.
    TreeMap
  • Carefully processed parts' data, and validated during our wet lab experiments
  • Will help teams better use distribution kits.
  • Future teams could use our code and data (both available at https://github.com/FDUiGEM2021Software) for further development.
  • A detailed user manual for Part Camera and an instructive interaction interface, super user-friendly. And, once installed, you do NOT need Internet for further usage.

# Acknowledgements

BioPython

selenium

pyecharts
docsify