Line 9: | Line 9: | ||
<title></title> | <title></title> | ||
<!-- 4444444444444444444444此处上传时需要删除44444444444444444 --> | <!-- 4444444444444444444444此处上传时需要删除44444444444444444 --> | ||
− | + | ||
− | + | ||
− | + | ||
<!-- 4444444444444444444444此处上传时需要删除44444444444444444 --> | <!-- 4444444444444444444444此处上传时需要删除44444444444444444 --> | ||
Line 21: | Line 19: | ||
#content { padding:0px; width:100%; margin-top:-7px; margin-left:0px; border:none;} | #content { padding:0px; width:100%; margin-top:-7px; margin-left:0px; border:none;} | ||
body, html {background-color:#ececea; width: 100%; height: 100%;scroll-behavior: smooth;} | body, html {background-color:#ececea; width: 100%; height: 100%;scroll-behavior: smooth;} | ||
− | + | ||
− | + | /* 666666666666666666666666666666666666666666666666666666666666以下为加载动画样式66666666666666666666666666666666666666666 */ | |
+ | #gif | ||
+ | { | ||
+ | position: fixed; | ||
+ | top: 0; | ||
+ | left: 0; | ||
+ | width: 100%; | ||
+ | height: 100%; | ||
+ | z-index: 99999; | ||
+ | /* display: -webkit-box; */ | ||
+ | -webkit-box-orient: horizontal; | ||
+ | -webkit-box-pack: center; | ||
+ | -webkit-box-align: center; | ||
+ | background: #fff; | ||
} | } | ||
+ | #gif>img/*设置gif下的img元素样式*/ | ||
+ | { | ||
+ | |||
+ | /* width:40%; */ | ||
+ | } | ||
+ | .gif2{ | ||
+ | top: 25%; | ||
+ | height:auto; | ||
+ | margin-left:30%; | ||
+ | width:40%; | ||
+ | } | ||
+ | .gif1{ | ||
+ | margin-left:35%; | ||
+ | top: 70%; | ||
+ | width:30%; | ||
+ | margin-top:-25px; | ||
+ | |||
+ | } | ||
+ | |||
+ | /* 666666666666666666666666666666666666666666666666666666666666以上为加载动画样式66666666666666666666666666666666666666666 */ | ||
/* 左侧导航栏样式 */ | /* 左侧导航栏样式 */ | ||
#bodyContent h2{ | #bodyContent h2{ | ||
Line 63: | Line 94: | ||
height:100%; | height:100%; | ||
background-color:rgba(0,0,0,0); | background-color:rgba(0,0,0,0); | ||
− | background:url("https://static.igem.org/mediawiki/2021/ | + | background:url("https://static.igem.org/mediawiki/2021/2/2f/T--SCUT-China--description-0.png") ; |
background-size: 100% 100%; | background-size: 100% 100%; | ||
background-attachment: fixed; | background-attachment: fixed; | ||
Line 138: | Line 169: | ||
} | } | ||
− | + | /* #nav li a:hover > img{ | |
+ | visibility:visible; | ||
+ | } */ | ||
#nav li.on{ | #nav li.on{ | ||
background:#dad9d5; | background:#dad9d5; | ||
− | + | } | |
#right-nav li a img{ | #right-nav li a img{ | ||
− | + | width: 30px; | |
− | + | height: auto; | |
− | + | float: right; | |
− | + | margin-right: 10px; | |
} | } | ||
#right-nav { | #right-nav { | ||
Line 157: | Line 190: | ||
} | } | ||
#right-nav li { | #right-nav li { | ||
− | + | width: 167px; | |
height: 32px; | height: 32px; | ||
line-height: 32px; | line-height: 32px; | ||
Line 531: | Line 564: | ||
<body > | <body > | ||
+ | |||
<!-- 顶部导航栏位置 --> | <!-- 顶部导航栏位置 --> | ||
<div class="header" > | <div class="header" > | ||
Line 670: | Line 704: | ||
</div> | </div> | ||
<!-- 444444444444444444444444444444444444444444444444444444444444444以下需要替换(直接替换即可)预览时字体效果不会出现44444444444444444444444444444444444444444444444444444444444444 --> | <!-- 444444444444444444444444444444444444444444444444444444444444444以下需要替换(直接替换即可)预览时字体效果不会出现44444444444444444444444444444444444444444444444444444444444444 --> | ||
+ | <!-- -0000000000000000000000000000000000000加载动图000000000000000000000000000000000000000 --> | ||
+ | <div id= "gif"> | ||
+ | <p style="font-size:30px;margin-left:43%;margin-top:20px;">Loading...</p> | ||
+ | <img class="gif2"src = "https://static.igem.org/mediawiki/2021/b/b8/T--SCUT-China--model-gif-bool.gif"/> | ||
+ | <img class="gif1"src = "https://static.igem.org/mediawiki/2021/1/12/T--SCUT-China--model-gif-bar.gif"/> | ||
+ | </div> | ||
+ | <!-- -0000000000000000000000000000000000000以上加载动图000000000000000000000000000000000000000 --> | ||
+ | |||
<!-- 111111111111111111111111111111111111111111111111左侧导航条,标准样式11111111111111111111111111111111111 --> | <!-- 111111111111111111111111111111111111111111111111左侧导航条,标准样式11111111111111111111111111111111111 --> | ||
− | + | <main class="main"> | |
− | + | <div class="main_div" > | |
− | + | <div class="left" > | |
− | + | <div class="siderBar"> | |
− | + | <div class="menu"> | |
− | + | <ul id="nav"> | |
− | + | <li style="font-size:26px;"><a class="a_nav">GA-HMM</a></li> | |
− | + | <li ><a>Background</a></li> | |
− | + | <li><a >Fundamental</a></li> | |
− | + | <li><a >Strategy</a></li> | |
− | + | <li><a > Result Analysis</a></li> | |
− | + | <li><a >Availability</a></li> | |
− | + | <li><a >References</a></li> | |
− | < | + | </ul> |
− | + | </div> | |
+ | </div> | ||
+ | </div> | ||
+ | |||
+ | <div class="right" style="width: 75%;float: right;"> | ||
+ | <div class="content-my"> | ||
+ | <div id="wrap"> | ||
+ | <hr class="hrmar" style="visibility:hidden;" id="content_part_h2"> | ||
+ | <hr class="hrmar"style="visibility:hidden;" id="content_part_h2"> | ||
+ | <div id="hrmar"> | ||
+ | <div id="content_part"> | ||
+ | <h2 class="content_text_h2">Background</h2> | ||
+ | <div class="content_div_text"> | ||
+ | <p id="p1_1"align="left"> Most eukaryotic genomic DNA is wrapped in nucleosomes, which occlude and <span>strongly distort</span> the wrapped DNA. Transcription factor is a special protein that binds to a specific region of the gene to start transcription downstream. However, in the chromosomal state, the genomic DNA is tightly bound to the nucleosome. It can not reveal the <span>transcribed factor binding sites</span> (TFBS), the upstream activating sequence (UAS) for the promoter. Only when the nucleosome is <span>relaxed or not tightly bound</span> could TF have more opportunity to bind UAS and function effectively. </p> | ||
+ | <p id="p1_2"align="left"> The affinity of nucleosomes to genomic DNA is determined by the <span>arrangement and combination</span> of sequence bases, so the affinity will be different if the sequence is slightly altered. Therefore, we hope to decrease the histone binding affinity score by introducing mutations to the sequence bases except for UAS and the core region. For this reason, a tool that can calculate the nucleosome affinity as an <span>evaluation function</span> to score the effect of the introduction of base fine-tuning is <span>in urgent need.</span> </p> | ||
+ | <p id="p1_3"align="left"> Suppose there is a scoring tool that could calculate histone binding affinity theoretically. In that case, we could calculate nucleosome affinity for all the possible arrangements of bases in a <span>fixed length</span>. The promoter sequence in theory with the lowest nucleosome affinity sequence and highest strength could be <span>selected</span>. Of course, this approach has extremely high time complexity and low specificity; that's the reason we choose to combine genetic algorithm to optimize the promoter sequence iteratively. </p> | ||
+ | </div> | ||
</div> | </div> | ||
− | + | </div> | |
− | + | ||
− | + | <hr class="hrmar"style="visibility:hidden;" id="content_part_h2"> | |
− | + | <div id="hrmar"> | |
− | + | <div id="content_part"> | |
− | + | <h2 class="content_text_h2">Fundamental</h2> | |
− | + | <div class="content_div_text"> | |
− | + | <p id="p2_1"align="left"> The hidden Markov model (HMM) has been known for decades<sup>[1]</sup>.<br> Given the characteristic of DNA sequence, and the accumulated pieces of evidence that DNA sequence itself can highly predict the localization of nucleosome in vivo, also the function of chromatin is closely related to the localization of nucleosome, scientists proposed using a <span>duration Hidden Markov Model</span> (dHMM) to predict the occupation of the corresponding nucleosomes in chromosomal DNA: divided into the <span>nucleosome (N) state </span>and<span> the linker (L) state</span>, where the length of the nucleosome state is fixed at 147 bp and the linker state is variable<sup>[2]</sup>. It is assumed that the two states must be alternately connected. A complete chromatin sequence must begin and end in the linker state, training a 4th order time-dependent Markov chain for the N state and a homogeneous 4th order Markov chain for the L state.<br> In the detailed method, they let <span>e</span> = e<sub>1</sub>, ..., e<sub>147</sub> be corresponding to the nucleosome DNA sequence, and let P<sub>N</sub> be the probability of observing <span>e</span> as a nucleosome, computed as the product of probabilities for both Watson and Crick strands under the 4th order Markov Chain model. They assume that the linker DNA length of a given species has an unknown distribution F<sub>L</sub> (k) defined for k = 1, ..., τ<sub>L</sub> (the maximum linker length allowed). An observed linker DNA sequence <span>e</span> = e<sub>1</sub>, ..., e<sub>k</sub> carries two pieces of information, the length is k bp, and given which, the emitted letters are e<sub>1</sub>, ..., e<sub>k</sub> . Let G<sub>L</sub> (<span>e</span>|k) denote the homogeneous Markov chain model for the linker DNA (again including both strands). Then observing <span>e</span> as a linker DNA has probability | |
− | <div | + | </p> |
− | + | <div class="content_div_img"> | |
− | + | <img style="width:50%;" src="img/model/HMM_function1.png"> | |
− | + | </div> | |
− | + | <p id="p2_2"align="left"> Suppose <span>x</span> = x<sub>1</sub>, ..., x<sub>n</sub> is a genomic DNA sequence of length n, where x<sub>i</sub> = A/C/G/T. Let <span>z</span> = z<sub>1</sub>, ..., z<sub>n</sub> be the corresponding hidden state path, where z<sub>i</sub> = 1 if x<sub>i</sub> is covered by a nucleosome state, and 0 otherwise. Suppose that the path <span>z</span> = z<sub>1</sub>, ..., z<sub>n</sub> partitions <span>x</span> into k consecutive nucleosome or linker state blocks, in which the nucleosome blocks have a uniform length of 147 bp, whereas the length of linker blocks may vary. They denote these blocks as <span>y</span> = <span>y+</span>, ..., <span>y<sub>B</sub> </span>, and their state identification as <span>s</span> = s<sub>1</sub>, ..., s<sub>B</sub> , where s<sub>i</sub> = 1 if <span>y<sub>i</sub></span> is nucleosome state, and 0 otherwise. The probability of observing (<span>x</span>, <span>z</span>) is given by </p> | |
− | + | <div class="content_div_img"> | |
− | + | <img style="width:90%;"src="img/model/HMM_function2.png"> | |
− | + | </div> | |
+ | <p id="p2_3"align="left"> where π<sub>0</sub>(s<sub>1</sub>) and π<sub>e</sub>(s<sub>B</sub> ) stand for the probabilities that the chain initializes and ends with state s<sub>1</sub> and s<sub>k</sub> respectively, and I is an indicator function. Since they assume that a complete chromatin sequence must start with and end in a linker state, π<sub>0</sub>(s<sub>1</sub> = 0) = π<sub>e</sub> (s<sub>B</sub> = 0) = 1. Define the nucleosome occupancy at a specific position i, denoted o i , as the posterior probability that z<sub>i</sub> = 1, i.e., </p> | ||
+ | <div class="content_div_img"> | ||
+ | <img style="width:50%;"src="img/model/HMM_function3.png"> | ||
+ | </div> | ||
+ | <p id="p2_4"align="left"> Most importantly, the histone binding affinity score at position i is defined as the log likelihood ratio for the region x<sub>i-73</sub>, ... x<sub>i</sub> , ..., x<sub>i+73</sub> to be a nucleosome vs. a linker, i.e., </p> | ||
+ | <div class="content_div_img"> | ||
+ | <img style="width:80%;"src="img/model/HMM_function4.png"> | ||
+ | </div> | ||
+ | <p id="p2_5"align="left"> Given the models P<sub>N</sub>, G<sub>L</sub> and F<sub>L</sub>, the optimal path <span>z</span> can be found by the standard Viterbi algorithm, and the nucleosome occupancy score can be estimated using forward and backward algorithms<sup>[2]</sup>. </p> | ||
+ | <p id="p2_6"align="left"> The model <span>has been trained</span> on a large number of yeast nucleosome DNA fragments dataset that have been pyrosequencing and on a species by species basis, and the test results are satisfactory<sup>[2]</sup>. They made the <span>nucleosome occupancy predicting tool</span> into a software tool called <span>NuPoP</span>, which means "nucleosome position predicting", as an R language package for researchers to use. </p> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
− | + | <hr class="hrmar"style="visibility:hidden;" id="content_part_h2"> | |
− | + | <div id="hrmar"> | |
− | + | <div id="content_part"> | |
− | + | <h2 class="content_text_h2">Strategy</h2> | |
− | + | <h4 class="content_text_h2" style="font-size:18px;margin-top:20px;"> Optimizing design</h4> | |
− | + | <div class="content_div_text"> | |
− | + | <p id="p3_1"align="left"> We designed to obtain the variants with different nucleosome affinity of the promoter sequence by introducing the limited amount of random bases <span>mutation</span> without destroying the UAS sequence and the core region of the promoter. | |
− | + | </p> | |
− | + | <p id="p3_2"align="left"> Actually, a <span>similar</span> iterative optimization model has been proposed, called <span>NucOpt Progs</span>. The theory is technically known as a “<span>Greedy algorithm</span>”, in which all possible random single mutations are <span>enumerated</span> on the initial promoter sequence, and every single sequence variant was calculated the nucleosome affinity by NuPoP to obtain a score. After that, the sequence with the lowest affinity was selected as the winner in this round, then taken as the parent placing into the next round, calculating all mutation possible cases enumeration on this winner and vote in a new champion variant candidate. Finally, the theoretical best optimal sequence which is difficult to gain higher score is obtained<sup>[3]</sup>. </p> | |
− | + | <p id="p3_3"align="left"> However, after learning and testing NucOpt Progs, we found that this method has many <span>disadvantages</span>. First, the exhaustive enumeration method brings<span> immense time complexity</span>. It takes 36 hours to optimize a sequence of 800 bp length (based on the computer capacity). Secondly, single mutation only and greedy algorithm in use obviously determine that it's extraordinarily easy to get optimization fallen into a <span>local optimal solution</span>. For example, assuming that a single base change is a component of a group of mutations that significantly decrease histone affinity, but rarely calculating the benefit exactly this single base change could bring is unremarkable. In this case, the superior choice would easily be neglected. Thirdly, taking into account the biological effect, other published studies have shown that rarely low affinity is <span>not directly related</span> to the promoter's strength<sup>[3]</sup>. At the same time, the researchers speculate, the site distribution of nucleosome affinity may play a vital role in the promoter strength. </p> | |
− | + | <p id="p3_4"align="left"> Given the above three defects, we have designed a software <span>NuPGO</span> which can make up for these shortcomings and achieve a <span>better optimization effect</span>. Our new strategy is to combine NuPoP with the genetic algorithm(Fig.1) using the <span>roulette wheel operator</span> so that the base number of mutations can realize <span>adaptive</span> convergence according to probability instead of only introducing a single random mutation at a time, which not only effectively reduce the time complexity but also take much more possibilities into account, escaping local optimal solution. Considering the biological effect, we additionally set the <span>user-defined region weights</span> especially, which enable the algorithm to place larger or smaller weights on the specific sequence region. </p> | |
− | + | <div class="content_div_img"> | |
− | + | <img style="width:85%;"src="img/model/Fig1_genetic_algorithm.png"> | |
− | + | </div> | |
− | + | <p style="font-size:16px;text-align:center;">Figure 1. Genetic algorithm schematic diagram</p> | |
− | + | </div> | |
− | + | </div> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
+ | <div id="content_part"> | ||
+ | <h4 class="content_text_h2" style="font-size:18px;margin-top:20px;"> Greedy Algorithm vs Roulette Operator</h4> | ||
+ | <div class="content_div_text"> | ||
+ | <p id="p3_5"align="left"> Greedy algorithm and roulette wheel selection algorithm are both essentially <span>evolutionary algorithms</span>. The greedy algorithm also performs better than other evolutionary algorithms in some cases. Still, its efficiency advantage brought by enumeration couldn't display here, while the <span>multi-factor considering</span> ability in wheel selection algorithm stands out. | ||
+ | </p> | ||
+ | </div> | ||
+ | <h4 class="content_text_h2" style="font-size:18px;margin-top:20px;"> Greedy Algorithm</h4> | ||
+ | <div class="content_div_text"> | ||
+ | <p id="p3_6"align="left"> The fatal disadvantage of the previous promoter optimizing algorithm is that only the<span> best performer</span> is selected in each round. Perhaps the author aimed to reduce the time complexity of exhaustive enumeration and simply pursue the best score. But such algorithms can easily trap the results into <span>local optima</span>. For example, the three-point mutation on site ABC decrease the histone affinity more than the three-point mutation on site DEF. However, the affinity decrease of site B mutation is smaller than that of site D mutation, so the greedy algorithm will choose site D mutation strategy as the winner, while site B strategy is obviously better actually. </p> | ||
+ | <p id="p3_7"align="left"> Moreover, only a random single mutation is introduced in each round. In other words, the mutation number is set to 1, which leads to gigantic amounts of <span>enumerations</span>, and the time complexity has not been sufficiently reduced. The execution of the greedy algorithm is time-consuming. In our test, an 800bp length promoter had to cost 36 hours to run. </p> | ||
+ | <div class="content_div_img"> | ||
+ | <img src="img/model/Fig2_Tradition_1080p.gif"> | ||
+ | </div> | ||
+ | <p style="font-size:16px;text-align:center;">Figure 2. Schematic diagram of NucOpt Progs</p> | ||
+ | </div> | ||
+ | <h4 class="content_text_h2" style="font-size:18px;margin-top:20px;"> Roulette Operator</h4> | ||
+ | <div class="content_div_text"> | ||
+ | <p id="p3_8"> The wheel selection method is also known as the fitness proportionate selection (FPS), which means that <span>the fitness of an individual is directly proportional to the probability of being selected</span>, similar to putting all the individuals on a wheel, the area occupied by each individual is directly proportional to its fitness, that is, the probability of being selected is determined by the area of the plate, namely the fittness of each individual when the wheel spins randomly.<br> In other words, the fitness is directly proportional to the probability of being selected, turning the wheel N times to <span>select N </span><span>individuals</span>. Meanwhile, it also means that an individual can be selected repeatedly. | ||
+ | </p> | ||
+ | <div class="content_div_img"> | ||
+ | <img src="img/model/Fig3_GA_1080p.gif"> | ||
+ | </div> | ||
+ | <p style="font-size:16px;text-align:center;">Figure 3. Schematic diagram of genetic algorithm applied to promoter optimizer NuPGO</p> | ||
+ | <p id="p3_9"> For random bases mutations introduced in each round, we set the mutation amount as an <span>adaptive</span> parameter. Each base position in the sequence is a probability mutation rather than the unintelligent setting the amount as "1" from beginning to end. To ensure the rationality and accuracy of the optimization, we set the threshold of <span>mutation probability</span> to 0.0005 so that the number of base mutations in each round would<span> not exceed </span> 5. Of course, other choices are worth expanding. By doing this, the <span>time complexity</span> of the model is reduced to some degree, the curve will <span>converge faster</span> when performing the gradient descends, broadening the search, and <span>more combinations</span> would be explored to approach the global optimal solution. As the iteration converges, the mutation probability will gradually approach 0, that is, reaching the final optimized promoter sequence(Fig.3). | ||
+ | </p> | ||
+ | </div> | ||
+ | </div> | ||
− | + | <div id="content_part"> | |
− | + | <h4 class="content_text_h2" style="font-size:18px;margin-top:20px;">Specific Weights</h4> | |
− | + | <div class="content_div_text"> | |
− | + | <p id="p3_10"align="left"> The critical UAS are usually only about 10 bp in length, and the vital goal we want to achieve, precisely speaking, is to reduce the nucleosome affinity around the <span>UAS region</span> and to make the UAS region tend to become the linker region rather than nucleosome region.<br> The final goal of the algorithm is to decrease the total affinity score of the whole promoter sequence as far as possible, which is <span>likely to ignore</span> the key UAS region mutation and choose other strategies with more significant benefit, the final optimized promoter obtained in such condition may not make use of the advantages of UAS. Specifically, the key UAS is still in the high histone affinity region. To compensate for this shortcoming, we decided to <span>artificially add a regional weight </span>to the score calculation process. According to the formula introduced above, dHMM calculates the nucleosome affinity score for every single site, and then calculates the area under the curve (AUC). <br> To make some adjustments, we decide to compute the score of every single position with a custom zoom when calculating the specific area that we regard special, such as the area of the crucial UAS, in other words, multiplying the simply calculated score by the user-defined weight to get each final single point score.<br> | |
− | + | </p> | |
− | + | <p id="p3_11"align="left"> Fortunately, the results of testing different weights on the NuPGO algorithm are <span>satisfactory</span>. After adding weights to a specific area, NuPGO will pay <span>more attention </span>to that area. In the final optimized result sequence, this specific region will get better performance of nucleosome affinity degradation. | |
− | + | </p> | |
+ | </div> | ||
+ | </div> | ||
− | + | <div id="content_part"> | |
− | + | <h4 class="content_text_h2" style="font-size:18px;margin-top:20px;">Sustainable</h4> | |
− | + | <div class="content_div_text"> | |
+ | <p id="p3_12"align="left"> Ultimately, and most worth noting is that, neither the NuPoP software program which can evaluate DNA’s nucleosome affinity<sup>[2]</sup> nor the iterative optimization code<sup>[3]</sup> that uses the greedy algorithm, because of the <span>components missing </span>and internet limitation, following the tutorials provided in the papers, we couldn’t get the complete executable code file and failed to run the optimizing software with greedy algorithm directly, as well as other researchers. It means the use of NuPoP and greedy algorithm optimizer have been difficult for those who lack the computer or bioinformatics study infrastructure.<br> Therefore, we <span>refactored</span> the implementation code almost completely according to some of its logic, and wrote the genetic algorithm using the roulette wheel operator, which is much easier to read and user-friendly much more than the previous code. In other words, we <span>fix the algorithm published before</span> and come up with a <span>more novel and intelligent</span> software. Both the greedy algorithm optimizer <span>NucOpt Progs</span> and genetic algorithm optimizer <span>NuPGO</span> are published by us for users’ convenient.<br> This means that <span>anyone</span>, especially iGEM teams in the future and researchers in labs, can <span>easily get started</span> using our program based on a tutorial. <span>Input your initial data</span> and conditions visually and easily and it will be on. Those interested are also welcomed to read our source code and make some comments or modifications, and maybe a stronger NuPGO will be come up with next year. | ||
+ | </p> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
− | |||
− | + | <hr class="hrmar"style="visibility:hidden;" id="content_part_h2"> | |
− | + | <div id="hrmar"> | |
− | + | <div id="content_part"> | |
− | + | <h2 class="content_text_h2">Result Analysis</h2> | |
+ | <h4 class="content_text_h2" style="font-size:18px;margin-top:20px;"> Optimization result on our promoter</h4> | ||
+ | <div class="content_div_img"> | ||
+ | <img src="img/model/Fig4_PDC1-M8.gif"> | ||
+ | </div> | ||
+ | <p style="font-size:16px;text-align:center;">Figure 4. NuPGO optimization on our project promoter nucleosome affinity</p> | ||
+ | <h4 class="content_text_h2" style="font-size:18px;margin-top:20px;"> Time-consuming comparison</h4> | ||
+ | <div class="content_div_img"> | ||
+ | <img src="img/model/.png"> | ||
+ | </div> | ||
− | + | <h4 class="content_text_h2" style="font-size:18px;margin-top:20px;"> Optimization effect comparison</h4> | |
− | + | <div class="content_div_img"> | |
− | + | <img src="img/model/Fig5_traGA.png"> | |
− | + | </div> | |
− | + | <p style="font-size:16px;text-align:center;">Figure 5. Result comparison between Greedy Algorithm and Genetic Algorithm, weights added result and result without weights.</p> | |
− | + | <div class="content_div_img"> | |
− | + | <img src="img/model/Fig6_Descent_curve.png"> | |
− | + | </div> | |
− | + | <p style="font-size:16px;text-align:center;">Figure 6. Descent curve of total nucleosome affinity score between Greedy Algorithm and Genetic Algorithm, weights added result and result without weights.</p> | |
− | + | ||
− | + | ||
− | + | ||
− | + | <div class="content_div_img"> | |
− | + | <img src="img/engineering_1.png"> | |
− | + | </div> | |
− | + | <p style="font-size:16px;text-align:center;">Figure 7. Result comparison between Greedy Algorithm and Genetic Algorithm, weights added result and result without weights.</p> | |
+ | <div class="content_div_img"> | ||
+ | <img src="img/engineering_1.png"> | ||
+ | </div> | ||
+ | <p style="font-size:16px;text-align:center;">Figure 8. Descent curve of total nucleosome affinity score between Greedy Algorithm and Genetic Algorithm, weights added result and result without weights.</p> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
+ | <h4 class="content_text_h2" style="font-size:18px;margin-top:20px;"> NuPGO result visualization</h4> | ||
+ | <div class="content_div_img"> | ||
+ | <img src="img/model/Fig9_Comparison_all_1080p_english.png"> | ||
+ | </div> | ||
+ | <p style="font-size:16px;text-align:center;">Figure 9. Comparison between initial and optimized project promoter M8 sequence</p> | ||
+ | <div class="content_div_img"> | ||
+ | <img src="img/model/Fig10_Comparison_GCR1_1080p_english.png"> | ||
+ | </div> | ||
+ | <p style="font-size:16px;text-align:center;">Figure 10. Partial protected UAS region during optimization</p> | ||
− | + | <h4 class="content_text_h2" style="font-size:18px;margin-top:20px;">NuPGO result verification</h4> | |
− | + | <div class="content_div_text"> | |
− | + | <p id="p4_1"align="left"> Unfortunately, the NuPGO-optimized promoter has some mutation of around hundreds of bp, requiring chemical synthesis to obtain the new promoter gene fragment. Moreover, it is not statistically representative of testing only the final variant. To prove that the strength of the NuPGO optimized promoter sequence is consistent with the predicted results, we need to select at least 5 to 10 promoter variants, and conduct chemical synthesis and characterization experiments to verify their strength. Due to the lack of project funds and time constraints, we can not perform characterization experiments to ascertain the strength of the sequence obtained by the optimization algorithm. <br> If conditions permit or any other team was interested in verifying the effect of NuPGO, we would like to or recommend selecting the promoter variants with 50, 100, 200, 400, 500bp mutations in the same run, also pick three variants at each of the position, such as 499, 500, 501bp. Then design the parallel characterization experiments to see the strength difference. The descent curve could also be consulted to obtain the selection choice strategy.<br> Nevertheless, we have reason to believe that the results of the optimization are relatively reliable, because, in our references(Fig.6), the authors conducted experiments on the strength characterization of the promoter variants obtained by the greedy algorithm. We can conclude from the experimental results that nucleosome affinity optimization indeed can increase the promoter's strength and thus enhance expression. | |
− | + | </p> | |
− | + | <div class="content_div_img"> | |
− | + | <img src="img/model/Fig11.png"> | |
− | + | </div> | |
+ | <p style="font-size:16px;text-align:center;">Figure 10. Redesign of native yeast promoters for increased expression by decreasing nucleosome affinity in reference article. <span>DOI: 10.1038/ncomms5002</span></p> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | <hr class="hrmar"style="visibility:hidden;" id="content_part_h2"> | ||
+ | <div id="hrmar"> | ||
+ | <div id="content_part"> | ||
+ | <h2 class="content_text_h2">Availability</h2> | ||
+ | <div class="content_div_text"> | ||
+ | <p id="5_1"> We publish our source code and packages on Github. We extraordinarily sincerely and strongly recommend that every future iGEM team or researcher using a eukaryotic promoter use our software to achieve a higher expressing level and rational use of UAS strong promoters. </p> | ||
+ | <a href="https://2021.igem.org/Team:SCUT-China">Github</a> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | <hr class="hrmar" id="content_part_h2" style="visibility:hidden;"> | ||
+ | <div id="hrmar"> | ||
+ | <div id="content_part"> | ||
+ | <h2 class="content_text_h2">References</h2> | ||
+ | <div class="content_div_text"> | ||
+ | <p>[1]L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," in Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, Feb. 1989, doi: 10.1109/5.18626. </p> | ||
+ | <p>[2]Xi, L., Fondufe-Mittendorf, Y., Xia, L. et al. Predicting nucleosome positioning using a duration Hidden Markov Model. BMC Bioinformatics <span>11,</span> 346 (2010). https://doi.org/10.1186/1471-2105-11-346 </p> | ||
+ | <p>[3] [3]Curran, K., Crook, N., Karim, A. et al. Design of synthetic yeast promoters via tuning of nucleosome architecture. Nat Commun <span>5,</span> 4002 (2014). https://doi.org/10.1038/ncomms5002 </p> | ||
+ | <p>[4]de Boer, C.G., Vaishnav, E.D., Sadeh, R. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat Biotechnol <span>38,</span> 56–65 (2020). https://doi.org/10.1038/s41587-019-0315-8. </p> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | </div> | ||
+ | </main> | ||
+ | <!-- -0000000000000000000000000000000000000底栏000000000000000000000000000000000000000 --> | ||
+ | <div id="bottom_div"> | ||
+ | <div id="bottom_div_img"> | ||
+ | <!-- <img class="" src="img/bottom.png" alt=""> --> | ||
+ | <div class="a_div"> | ||
+ | <a href="https://2021.igem.org/Team:SCUT-China"><div class="a_div"></div></a> | ||
+ | </div> | ||
+ | </div> | ||
− | + | </body> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | <!-- 加载进度条 --> | |
− | + | <!-- <script type="text/javascript"> | |
+ | window.onload=function(){ | ||
+ | var loading='<div class="loading"><div class="pic"></div></div>'; | ||
+ | var varbody = document.getElementById("body"); | ||
+ | var varloading = document.getElementById("loading"); | ||
+ | varbody.append(loading); | ||
+ | setInterval(function(){ | ||
+ | varloading.fadeOut(); | ||
+ | },3000); | ||
+ | } | ||
+ | </script> --> | ||
− | + | <!-- 左侧导航栏固定与滑动 --> | |
− | + | <!-- 左侧导航栏固定与滑动 --> | |
+ | <script type="text/javascript"> | ||
+ | var nav = document.getElementById("nav"); | ||
+ | var card = document.getElementById("card-holder"); | ||
+ | var divBox = document.getElementById("divBox_0"); | ||
+ | var navH =nav.offsetTop-80; | ||
+ | window.addEventListener('scroll',function(e){ | ||
− | + | var scroH = document.body.scrollTop; | |
− | + | if(scroH>=navH){ | |
− | + | nav.style.position = 'fixed' | |
− | + | nav.style.top = '50px' | |
− | + | nav.style.width='23%' | |
− | + | card.style.visibility = 'visible' | |
− | + | divBox.style.visibility = 'visible' | |
− | + | }else{ | |
− | + | // alert( scroH-navH); | |
− | + | nav.style.position = 'static' | |
+ | nav.style.width='98%' | ||
+ | card.style.visibility = 'hidden'//可以设置它隐藏 | ||
+ | divBox.style.visibility = 'hidden' | ||
+ | } | ||
+ | }) | ||
+ | </script> | ||
+ | <!-- 左侧导航栏交互效果- --> | ||
+ | <script> | ||
+ | window.onload = function () { | ||
+ | var allDiv = document.querySelectorAll("#wrap #content_part_h2"); | ||
+ | var allNavLi = document.querySelectorAll("#nav li"); | ||
+ | var Div = document.getElementById("divBox"); | ||
+ | /** | ||
+ | * 滚动一定距离时,返回顶部按钮出现 滚动页面改变导航li的弹出 | ||
+ | */ | ||
+ | function toTopIcon() { | ||
+ | window.onscroll = function () { | ||
+ | var d = document.documentElement.scrollTop || document.body.scrollTop;//获取滚动条高度 | ||
+ | var viewd = document.documentElement.clientHeight;//获取可视区高度 | ||
− | + | for (var i = 0; i < allDiv.length; i++) { | |
− | + | allDiv[i].index = i; | |
− | + | if (d >= allDiv[i].offsetTop && d < allDiv[i].offsetTop + 500) { | |
− | + | for (var j = 0; j < allDiv.length; j++) { | |
− | + | allNavLi[j].className = ""; | |
− | + | allNavLi[i].className = "on"; | |
− | + | } | |
− | + | } | |
− | + | } | |
+ | <!----> | ||
+ | var d = document.documentElement.scrollTop || document.body.scrollTop;//获取滚动条高度 | ||
+ | var h = document.body.scrollHeight;//全文高 | ||
+ | var view = document.documentElement.clientHeight;//获取可视区高度 | ||
+ | if(d>view){ | ||
+ | Div.style.height = (d-view)/h*1250 + "px"; | ||
+ | Div.style.visibility = "visible"; | ||
+ | }else { | ||
+ | Div.style.visibility = "hidden"; | ||
+ | } | ||
− | + | } | |
− | + | } | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | /** | |
− | + | * 点击导航li li弹出 页面跳转到相应div位置 | |
− | + | * scrollTo默认的是瞬间滚动到坐标位置,把behavior属性设置为smooth就可以支持平滑滚动了,不过缺点是无法设置滚动速率 | |
− | + | */ | |
− | + | function navLiPop() { | |
− | + | for (var i = 0; i < allNavLi.length; i++){ | |
− | + | allNavLi[i].index = i; | |
− | + | allNavLi[i].onclick = function () { | |
− | + | for (var j = 0; j < allNavLi.length; j++){ | |
− | + | window.scrollTo({ | |
− | + | top:allDiv[this.index].offsetTop-50, | |
+ | behavior:"smooth" | ||
+ | }); | ||
+ | } | ||
+ | } | ||
+ | } | ||
+ | } | ||
− | + | /** | |
− | + | * 点击按钮回到顶部(定时器平滑滚动) | |
− | + | */ | |
− | + | function navToTop() { | |
− | + | } | |
− | + | navLiPop(); | |
− | + | toTopIcon(); | |
− | + | navToTop(); | |
− | + | } | |
− | + | </script> | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | <!-- 111111111111111111111111111111111111111111111切换语言 --> | |
− | + | <script type="text/javascript"> | |
− | + | var count=0; | |
− | + | // var oP=document.getElementsByClassName("content_bglt"); | |
− | + | function changeLanguage(){ | |
− | + | // alert("点击事件发生"); | |
− | + | count++; | |
− | + | if(count%2==1){ | |
− | + | document.getElementById("p1_1").innerHTML=" 染色体是由DNA与核小体缠绕紧密结合形成超螺旋而形成的,转录因子是一种特殊的蛋白质,会结合到基因的特定区域使下游开始转录,因此成为转录因子。而在染色体状态下的基因与核小体紧密地结合与盘旋,无法暴露出能够让核小体特异性结合的序列(transcriptional factor binding sites,TFBS),which对于启动子而言就是上游激活序列(upstream activating sequence,UAS)。只有在核小体松弛的时候或结合不紧密的区段,TF才能够更顺利地结合上UAS并发挥作用。"; | |
− | + | document.getElementById("p1_2").innerHTML=" 核小体对于基因的亲和力,是由基因的碱基排列组合决定的,因此碱基序列稍有不同,其对核小体的亲和力也会有不同。因此我们希望通过改变UAS以外的碱基序列,让启动子对核小体的亲和力降低。为此我们亟需一个工具 能够计算序列的核小体亲和力,作为评估函数,为引入的碱基微调的效果打分。"; | |
− | + | document.getElementById("p1_3").innerHTML=" 假设拥有这样能够计算核小体亲和力的打分工具,理论上我们就能计算所有碱基排列的组合的核小体亲和力,选出亲和力最低的序列理论上就代表着强度最高的启动子。当然这样的思路拥有高时间复杂度和低特异性,我们选择结合遗传算法对启动子序列进行迭代优化。"; | |
− | + | document.getElementById("p2_1").innerHTML=" 隐马尔科夫模型已经被世人研究经过了几十余年,鉴于DNA序列本身的特性,以及许多积累的证据都能够表明DNA序列本身就能高度预测核小体在体内的定位,而且染色质的功能也与核小体定位有密切关系,因此科学家提出了用dHMM模型预测染色体DNA对应的核小体占有情况:分为nucleosome(N)状态和linker(L)状态,N状态的长度固定为147bp,而L状态的长度是可变的。假定两个状态一定是交替相连的,并且一个完整的染色质序列以linker状态起始和结束。</br>"+"设 e = e<sub>1</sub>,... ,e<sub>147</sub>是一段核小体 DNA 序列。设 p<sub>n</sub> 是观察 e 属于核小体区域的概率,计算为4阶马尔可夫链模型下两个沃森与克里克的概率的乘积。</br>我们假设一个给定物种的linker DNA 长度可以遵循一个未知的分布,将f<sub> l </sub>(k)定义为 k = 1,... ,τ<sub>l</sub> (我们允许的最大linker度)。一个被观测的linker DNA 序列 e = e<sub>1</sub>,... ,e <sub>k</sub> 携带两段信息,长度是 k bp,给它定义的字母是 e<sub>1</sub>,... ,e <sub>k</sub>。设 G<sub>L</sub> (e | k)表示连接 DNA 的齐次马尔可夫链模型(同样包括两条链) 。然后观察 e 作为linker的可能性。"; | |
− | + | document.getElementById("p2_2").innerHTML=" 设 x = x<sub>1</sub>,... ,x<sub>n</sub> 是一个长度为 n 的 DNA 序列,其中 x<sub>i </sub>= A/C/G/T 设 z = z<sub>1</sub>,... ,z<sub>n</sub> 为对应的隐状态路径,其中如果 x<sub>i </sub>被一个nucleosome状态覆盖,z<sub> i</sub> = 1,否则为0。假设路径 z = z<sub>1</sub>,... ,z<sub>n</sub> 将 x 划分为 k 个连续的nucleosome或连接linker区,其中nucleosome区域的长度一致为147 bp,而linker的长度可能不同。我们把这些区块表示为 y = y + ,... ,y <sub>b</sub>,它们的状态标记为 s = s<sub>1</sub>,... ,s <sub>b</sub>,其中如果 y<sub> i</sub> 是nucleosome状态,s <sub>i</sub> = 1,否则为0。观测概率(x,z)计算方法为"; | |
− | + | document.getElementById("p2_3").innerHTML=" 其中, π<sub>0</sub>(s<sub>1</sub>)和 π<sub>e </sub>(s <sub>b</sub>) 分别表示DNA链的初始和结束状态为 s<sub>1</sub>和 s<sub>k</sub> 的概率,I是指示函数。由于我们假设一个完整的染色质序列必须以linker状态开始和结束,π<sub>0</sub>(s<sub>1 </sub>= 0) = π<sub>e</sub> (s<sub>b</sub> = 0) = 1。我们定义了在一个特定位置 i 上的核小体占有,即 z<sub> i</sub> = 1的后验概率,"; | |
− | + | document.getElementById("p2_4").innerHTML=" 位于 i 位置的组蛋白结合亲和力评分为区域 x<sub> i-73</sub>,... x <sub>i</sub>,... ,x<sub> i + 73</sub>属于nucleosome区域和属于linker区域各自概率的对数似然比,即,"; | |
− | + | document.getElementById("p2_5").innerHTML=" 给定模型 P<sub>N</sub>,G<sub>L</sub> 和 F<sub>L</sub>,最优路径 z 可以通过标准的维特比算法算法找到,核小体占用得分可以通过前向和后向算法估计。"; | |
− | + | document.getElementById("p2_6").innerHTML=" 该模型已经过根据大量经过焦磷酸测序的酵母核小体DNA片段数据以及按不同物种划分的数据进行训练,测试结果表现良好,并制作成了一个名为NuPoP的软件工具,作为R包供研究人员使用。"; | |
− | < | + | |
− | + | ||
− | + | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | document.getElementById("p3_1").innerHTML=" 我们设计在不破坏UAS序列和启动子核心区域的条件下,引入随机碱基突变,获得启动子序列的变体with不同的核小体亲和力。"; | |
− | + | document.getElementById("p3_2").innerHTML=" 之前已经有一个类似的迭代优化模型被提出,使用的方法在学术上称作“贪心算法”,具体步骤而言,是对初始启动子序列引入穷尽枚举的所有可能的随机单点突变,并利用NuPoP计算所有变体的核小体亲和力,将亲和力最低的一条序列作为优胜者筛出,作为母本进入到下一轮的随机突变枚举,依次循环迭代下去,最终得到分数难以再产生变化的理论最优序列。"; | |
− | + | document.getElementById("p3_3").innerHTML=" 但是这样的方法存在诸多弊端,首先1穷尽枚举法带来了非常大的时间复杂度,优化一条长度为800bp的序列需要耗时36个小时,其次2只采用单点突变以及贪心算法,非常容易使优化进入到局部最优解,假设某一碱基的突变是大幅度降低亲和力的一组突变中的其中一个碱基突变,但是单观察这个碱基突变所带来的收益是非常小的,这就容易导致这一更优情况被忽略,另外3已发表的研究指出,单纯的低亲和力并不与启动子的强度有着直接的联系,研究者推测,核小体亲和力的位置分布可能起到影响启动子强度的关键作用。"; | |
− | + | document.getElementById("p3_4").innerHTML=" 鉴于以上3点缺陷,我们设计出了能够弥补这些缺点,达到更好优化效果的NuPGO。我们的新策略是,将NuPoP与遗传算法with轮盘算子结合,突变的碱基数能够根据概率自收敛而不是每次只能引入单点随机突变。"; | |
+ | document.getElementById("p3_5").innerHTML=" 贪婪算法和轮盘赌选择法,本质上都属于遗传算法,分别是其中的两种遗传算子(GA Operator),排序选择和轮盘选择(rank-based selection and roulette wheel selection)<br> 贪婪算法和轮盘赌遗传算法本质上都是进化算法。贪婪算法也在部分情况下表现得比其他进化算法更好。但是贪婪算法的优势其实在这里没有体现出来,简单枚举减少计算量但是增加了时间复杂度,而且得到的效果并不太好。遗传算法中的轮盘算子能够考虑多种进化方向,这一优点能够在启动子优化案例中很好地体现出来。"; | ||
+ | document.getElementById("p3_6").innerHTML=" 先前的算法的致命缺点就在于只选取了每轮中的表现最优者,作者设计时是为了减少穷尽枚举法的时间复杂度,以及追求最优的得分。但这样的算法非常容易让结果陷入局部最优解,例如,ABC三点突变比DEF三点突变所降低的亲和力要更多,但B位置的碱基单点突变所下降的核小体亲和力比D位置的碱基单点突变所下降的核小体亲和力少,贪婪算法会选择D位置的碱基突变策略。"; | ||
+ | document.getElementById("p3_7").innerHTML=" 而且在每轮中只引入随机单点突变,设置每轮的突变数为1,导致枚举的数量依旧非常大,时间复杂度没有得到足够的减轻。贪婪算法的执行非常耗费时间,在我们的测试中,一条长度为800bp的启动子就需要等待36小时的运行时间。"; | ||
+ | document.getElementById("p3_8").innerHTML=" 轮盘选择法也成为适应比例选择法(fitness proportionate selection,FPS),指的是个体的适应度直接根据正比关系得到被选择的概率,类似于将所有个体放在一个轮盘上,每个个体所占的面积与适应度成正比关系,即当轮盘转起来随机落在一个区域上时,被选中的概率由板块的面积所决定,即适应度与被选中的概率成正比关系,转动n次轮盘,选出n个个体,也意味着一个个体可以被重复选择。"; | ||
+ | document.getElementById("p3_9").innerHTML=" 对于每轮引入的随机碱基突变,我们将碱基突变数设置为自适应参数,使序列的每个碱基位置都是依概率突变,而不是从头到尾都是设定的1。为了保证优化的合理性和精确性,我们把突变概率的阈值设为0.0005,以让每轮中的碱基突变数不超过5。这样的设置使模型的时间复杂度大大削减,梯度下降时曲线会更快地收敛,还能够探索更多的组合可能,更逼近全局最优解。随着迭代的收敛,突变概率也会慢慢趋近于0,即达到最终的优化完成的序列。"; | ||
+ | document.getElementById("p3_10").innerHTML=" 起核心作用的UAS通常只有20bp左右的基序长度,我们最关键想要实现的,准确地说,应该是降低UAS所在区域的核小体亲和力,让UAS所在片段倾向于成为linker区而不是nucleosome区。</br> 算法的目标是尽可能降低整条启动子序列的核小体亲和力总分,有可能会忽略突变核心UAS周边而选择收益更大的突变,最终优化后得到的启动子可能并没有凸显UAS的优势。为了弥补这个缺点,我们决定人为地加入评分时的区域权重。根据上述公式我们可以得知,持续时间隐马尔可夫模型dHMM在计算核小体亲和力时是计算每一个单独碱基位置的核小体亲和力得分,再求取其AUC。</br> 因此我们可以在计算单点得分时让我们想要指定的特定区域,例如核心UAS所在的区域,获得一个自定义的缩放,分数与权重相乘后得到最终的单点位置得分。</br>"; | ||
+ | document.getElementById("p3_11").innerHTML=" 在NuPGO的算法上测试添加不同权重的结果是令人满意的,在给特定区域添加权重后,NuPGO会更重视这一区域,在最终的优化结果序列中这一特定区域会得到更好的核小体亲和力下降表现。"; | ||
+ | document.getElementById("p3_11").innerHTML=" 最终值得一提的是,无论是用于评估DNA的核小体亲和力的NuPoP软件程序,还是使用贪婪算法的迭代优化代码,由于网络限制以及贪婪算法部分内容缺失,根据论文提供的途径已经无法获取完整的可执行的代码文件,无法直接运行贪婪算法的优化程序,NuPoP的使用对于部分缺少计算机或生物信息学基础的人而言还是比较吃力的。/<br> 因此我们几乎完全重构了用于实现的代码,并写成了利用轮盘赌算子的遗传算法,并且比之前的代码更加易于阅读和user-friendly。无论是贪婪算法NucOpt Progs还是遗传算法NuPGO,我们都一并上传了完整的可执行代码集。</br> 这意味着没有生信基础的人、将来的iGEM团队或者是实验室里的研究人员,都可以根据教程轻易地上手开始使用我们的程序,直观且便捷地输入自己的初始数据和条件。"; | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | document.getElementById("p3_11").innerHTML=" 非常可惜的是,经过NuPGO优化获得的启动子通常已经发生了几百bp的突变,需要通过化学合成方法得到新启动子基因片段。而且,只对最终的一条变体进行实验验证,在统计学上而言是不具有代表意义的。想要证明NuPGO所更新得到的启动子序列强度提升与预测的结果相吻合,需要挑选最少5~10条启动子变体序列,进行化学合成和表征实验验证其强度。由于项目经费的缺乏以及时间限制,我们无法对通过优化算法得到的序列进行表征实验验证其强度。</br> 如果条件允许,或者有其他的队伍对我们的模型感兴趣并想要验证NuPGO的效果,我们会采取或推荐在一次优化过程中,选取突变了50, 100, 200, 400, 500bp的变体进行表征实验测试启动子的强度,并且在500bp左右要多选取几条变体,如突变了499, 500, 501bp的3条启动子都挑选出来做表征实验,这样才有统计学意义。另外也可以根据软件生成的下降曲线,设定固定步长,如下降了50, 100, 150, … 500, 550核小体亲和力的变体序列,挑选出来进行表征实验。</br> 但我们有理由相信优化结果是相对可靠的,这是因为在我们的参考文献(图6)中作者进行了贪婪算法得到的启动子变体强度表征实验。我们可以从实验结果中得出结论,核小体亲和力优化确实能够提高启动子的强度从而增强表达。"; | |
− | + | ||
− | + | document.getElementById("p3_11").innerHTML=" 我们把开源代码和软件包都放在了Github,由衷地并强烈地推荐每一个使用真核启动子的未来iGEM队伍或研究人员使用我们的傻瓜式软件获得更高表达、更合理利用UAS的强启动子。"; | |
− | + | // ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
+ | }else{ | ||
+ | document.getElementById("p1_1").innerHTML=" Most eukaryotic genomic DNA is wrapped in nucleosomes, which occlude and <span>strongly distort</span> the wrapped DNA. Transcription factor is a special protein that binds to a specific region of the gene to start transcription downstream. However, in the chromosomal state, the genomic DNA is tightly bound to the nucleosome. It can not reveal the <span>transcribed factor binding sites</span> (TFBS), the upstream activating sequence (UAS) for the promoter. Only when the nucleosome is <span>relaxed or not tightly bound</span> could TF have more opportunity to bind UAS and function effectively."; | ||
+ | document.getElementById("p1_2").innerHTML=" The affinity of nucleosomes to genomic DNA is determined by the <span>arrangement and combination</span> of sequence bases, so the affinity will be different if the sequence is slightly altered. Therefore, we hope to decrease the histone binding affinity score by introducing mutations to the sequence bases except for UAS and the core region. For this reason, a tool that can calculate the nucleosome affinity as an <span>evaluation function</span> to score the effect of the introduction of base fine-tuning is <span>in urgent need.</span>"; | ||
+ | document.getElementById("p1_3").innerHTML=" Suppose there is a scoring tool that could calculate histone binding affinity theoretically. In that case, we could calculate nucleosome affinity for all the possible arrangements of bases in a <span>fixed length</span>. The promoter sequence in theory with the lowest nucleosome affinity sequence and highest strength could be <span>selected</span>. Of course, this approach has extremely high time complexity and low specificity; that's the reason we choose to combine genetic algorithm to optimize the promoter sequence iteratively."; | ||
+ | document.getElementById("p2_1").innerHTML=" The hidden Markov model (HMM) has been known for decades<sup>[1]</sup>.<br />"+"Given the characteristic of DNA sequence, and the accumulated pieces of evidence that DNA sequence itself can highly predict the localization of nucleosome in vivo, also the function of chromatin is closely related to the localization of nucleosome, scientists proposed using a <span>duration Hidden Markov Model</span> (dHMM) to predict the occupation of the corresponding nucleosomes in chromosomal DNA: divided into the <span>nucleosome (N) state </span>and<span> the linker (L) state</span>, where the length of the nucleosome state is fixed at 147 bp and the linker state is variable<sup>[2]</sup>. It is assumed that the two states must be alternately connected. A complete chromatin sequence must begin and end in the linker state, training a 4th order time-dependent Markov chain for the N state and a homogeneous 4th order Markov chain for the L state. <br />"+"In the detailed method, they let <span>e</span> = e<sub>1</sub>, ..., e<sub>147</sub> be corresponding to the nucleosome DNA sequence, and let P<sub>N</sub> be the probability of observing <span>e</span> as a nucleosome, computed as the product of probabilities for both Watson and Crick strands under the 4th order Markov Chain model. They assume that the linker DNA length of a given species has an unknown distribution F<sub>L</sub> (k) defined for k = 1, ..., τ<sub>L</sub> (the maximum linker length allowed). An observed linker DNA sequence <span>e</span> = e<sub>1</sub>, ..., e<sub>k</sub> carries two pieces of information, the length is k bp, and given which, the emitted letters are e<sub>1</sub>, ..., e<sub>k</sub> . Let G<sub>L</sub> (<span>e</span>|k) denote the homogeneous Markov chain model for the linker DNA (again including both strands). Then observing <span>e</span> as a linker DNA has probability."; | ||
− | + | document.getElementById("p2_2").innerHTML=" Suppose <span>x</span> = x<sub>1</sub>, ..., x<sub>n</sub> is a genomic DNA sequence of length n, where x<sub>i</sub> = A/C/G/T. Let <span>z</span> = z<sub>1</sub>, ..., z<sub>n</sub> be the corresponding hidden state path, where z<sub>i</sub> = 1 if x<sub>i</sub> is covered by a nucleosome state, and 0 otherwise. Suppose that the path <span>z</span> = z<sub>1</sub>, ..., z<sub>n</sub> partitions <span>x</span> into k consecutive nucleosome or linker state blocks, in which the nucleosome blocks have a uniform length of 147 bp, whereas the length of linker blocks may vary. They denote these blocks as <span>y</span> = <span>y+</span>, ..., <span>y<sub>B</span></sub> , and their state identification as <span>s</span> = s<sub>1</sub>, ..., s<sub>B</sub> , where s<sub>i</sub> = 1 if <span>y<sub>i</span></sub> is nucleosome state, and 0 otherwise. The probability of observing (<span>x</span>, <span>z</span>) is given by"; | |
− | + | document.getElementById("p2_3").innerHTML=" where π<sub>0</sub>(s<sub>1</sub>) and π<sub>e</sub>(s<sub>B</sub> ) stand for the probabilities that the chain initializes and ends with state s<sub>1</sub> and s<sub>k</sub> respectively, and I is an indicator function. Since they assume that a complete chromatin sequence must start with and end in a linker state, π<sub>0</sub>(s<sub>1</sub> = 0) = π<sub>e</sub> (s<sub>B</sub> = 0) = 1. Define the nucleosome occupancy at a specific position i, denoted o i , as the posterior probability that z<sub>i</sub> = 1, i.e.,"; | |
− | + | document.getElementById("p2_4").innerHTML=" Most importantly, the histone binding affinity score at position i is defined as the log likelihood ratio for the region x<sub>i-73</sub>, ... x<sub>i</sub> , ..., x<sub>i+73</sub> to be a nucleosome vs. a linker, i.e.,"; | |
+ | document.getElementById("p2_5").innerHTML=" Given the models P<sub>N</sub>, G<sub>L</sub> and F<sub>L</sub>, the optimal path <span>z</span> can be found by the standard Viterbi algorithm, and the nucleosome occupancy score can be estimated using forward and backward algorithms<sup>[2]</sup>."; | ||
+ | document.getElementById("p2_6").innerHTML=" The model <span>has been trained</span> on a large number of yeast nucleosome DNA fragments dataset that have been pyrosequencing and on a species by species basis, and the test results are satisfactory<sup>[2]</sup>. They made the <span>nucleosome occupancy predicting tool</span> into a software tool called <span>NuPoP</span>, which means 'nucleosome position predicting', as an R language package for researchers to use."; | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | document.getElementById("p3_1").innerHTML=" We designed to obtain the variants with different nucleosome affinity of the promoter sequence by introducing the limited amount of random bases <span>mutation</span> without destroying the UAS sequence and the core region of the promoter."; | |
− | + | document.getElementById("p3_2").innerHTML=" Actually, a <span>similar</span> iterative optimization model has been proposed, called <span>NucOpt Progs</span>. The theory is technically known as a “<span>Greedy algorithm</span>”, in which all possible random single mutations are <span>enumerated</span> on the initial promoter sequence, and every single sequence variant was calculated the nucleosome affinity by NuPoP to obtain a score. After that, the sequence with the lowest affinity was selected as the winner in this round, then taken as the parent placing into the next round, calculating all mutation possible cases enumeration on this winner and vote in a new champion variant candidate. Finally, the theoretical best optimal sequence which is difficult to gain higher score is obtained<sup>[3]</sup>."; | |
− | + | document.getElementById("p3_3").innerHTML=" However, after learning and testing NucOpt Progs, we found that this method has many <span>disadvantages</span>. First, the exhaustive enumeration method brings<span> immense time complexity</span>. It takes 36 hours to optimize a sequence of 800 bp length (based on the computer capacity). Secondly, single mutation only and greedy algorithm in use obviously determine that it's extraordinarily easy to get optimization fallen into a <span>local optimal solution</span>. For example, assuming that a single base change is a component of a group of mutations that significantly decrease histone affinity, but rarely calculating the benefit exactly this single base change could bring is unremarkable. In this case, the superior choice would easily be neglected. Thirdly, taking into account the biological effect, other published studies have shown that rarely low affinity is <span>not directly related</span> to the promoter's strength<sup>[3]</sup>. At the same time, the researchers speculate, the site distribution of nucleosome affinity may play a vital role in the promoter strength."; | |
− | + | document.getElementById("p3_4").innerHTML=" Given the above three defects, we have designed a software <span>NuPGO</span> which can make up for these shortcomings and achieve a <span>better optimization effect</span>. Our new strategy is to combine NuPoP with the genetic algorithm(Fig.1) using the <span>roulette wheel operator</span> so that the base number of mutations can realize <span>adaptive</span> convergence according to probability instead of only introducing a single random mutation at a time, which not only effectively reduce the time complexity but also take much more possibilities into account, escaping local optimal solution. Considering the biological effect, we additionally set the <span>user-defined region weights</span> especially, which enable the algorithm to place larger or smaller weights on the specific sequence region."; | |
− | + | document.getElementById("p3_5").innerHTML=" Greedy algorithm and roulette wheel selection algorithm are both essentially <span>evolutionary algorithms</span>. The greedy algorithm also performs better than other evolutionary algorithms in some cases. Still, its efficiency advantage brought by enumeration couldn't display here, while the <span>multi-factor considering</span> ability in wheel selection algorithm stands out."; | |
− | + | document.getElementById("p3_6").innerHTML=" The fatal disadvantage of the previous promoter optimizing algorithm is that only the<span> best performer</span> is selected in each round. Perhaps the author aimed to reduce the time complexity of exhaustive enumeration and simply pursue the best score. But such algorithms can easily trap the results into <span>local optima</span>. For example, the three-point mutation on site ABC decrease the histone affinity more than the three-point mutation on site DEF. However, the affinity decrease of site B mutation is smaller than that of site D mutation, so the greedy algorithm will choose site D mutation strategy as the winner, while site B strategy is obviously better actually."; | |
+ | document.getElementById("p3_7").innerHTML=" Moreover, only a random single mutation is introduced in each round. In other words, the mutation number is set to 1, which leads to gigantic amounts of <span>enumerations</span>, and the time complexity has not been sufficiently reduced. The execution of the greedy algorithm is time-consuming. In our test, an 800bp length promoter had to cost 36 hours to run. "; | ||
+ | document.getElementById("p3_8").innerHTML=" The wheel selection method is also known as the fitness proportionate selection (FPS), which means that <span>the fitness of an individual is directly proportional to the probability of being selected</span>, similar to putting all the individuals on a wheel, the area occupied by each individual is directly proportional to its fitness, that is, the probability of being selected is determined by the area of the plate, namely the fittness of each individual when the wheel spins randomly.<br> In other words, the fitness is directly proportional to the probability of being selected, turning the wheel N times to <span>select N </span><span>individuals</span>. Meanwhile, it also means that an individual can be selected repeatedly."; | ||
+ | document.getElementById("p3_9").innerHTML=" For random bases mutations introduced in each round, we set the mutation amount as an <span>adaptive</span> parameter. Each base position in the sequence is a probability mutation rather than the unintelligent setting the amount as '1' from beginning to end. To ensure the rationality and accuracy of the optimization, we set the threshold of <span>mutation probability</span> to 0.0005 so that the number of base mutations in each round would<span> not exceed</span> 5. Of course, other choices are worth expanding. By doing this, the <span>time complexity</span> of the model is reduced to some degree, the curve will <span>converge faster</span> when performing the gradient descends, broadening the search, and <span>more combinations</span> would be explored to approach the global optimal solution. As the iteration converges, the mutation probability will gradually approach 0, that is, reaching the final optimized promoter sequence(Fig.3)."; | ||
+ | document.getElementById("p3_10").innerHTML=" The critical UAS are usually only about 10 bp in length, and the vital goal we want to achieve, precisely speaking, is to reduce the nucleosome affinity around the <span>UAS region</span> and to make the UAS region tend to become the linker region rather than nucleosome region.</br> The final goal of the algorithm is to decrease the total affinity score of the whole promoter sequence as far as possible, which is <span>likely to ignore</span> the key UAS region mutation and choose other strategies with more significant benefit, the final optimized promoter obtained in such condition may not make use of the advantages of UAS. Specifically, the key UAS is still in the high histone affinity region. To compensate for this shortcoming, we decided to <span>artificially add a regional weight </span>to the score calculation process. According to the formula introduced above, dHMM calculates the nucleosome affinity score for every single site, and then calculates the area under the curve (AUC). </br> To make some adjustments, we decide to compute the score of every single position with a custom zoom when calculating the specific area that we regard special, such as the area of the crucial UAS, in other words, multiplying the simply calculated score by the user-defined weight to get each final single point score.<br>"; | ||
+ | document.getElementById("p3_11").innerHTML=" Fortunately, the results of testing different weights on the NuPGO algorithm are <span>satisfactory</span>. After adding weights to a specific area, NuPGO will pay <span>more attention </span>to that area. In the final optimized result sequence, this specific region will get better performance of nucleosome affinity degradation."; | ||
+ | document.getElementById("p3_12").innerHTML=" Ultimately, and most worth noting is that, neither the NuPoP software program which can evaluate DNA’s nucleosome affinity<sup>[2]</sup> nor the iterative optimization code<sup>[3]</sup> that uses the greedy algorithm, because of the <span>components missing </span>and internet limitation, following the tutorials provided in the papers, we couldn’t get the complete executable code file and failed to run the optimizing software with greedy algorithm directly, as well as other researchers. It means the use of NuPoP and greedy algorithm optimizer have been difficult for those who lack the computer or bioinformatics study infrastructure.</br> Therefore, we <span>refactored</span> the implementation code almost completely according to some of its logic, and wrote the genetic algorithm using the roulette wheel operator, which is much easier to read and user-friendly much more than the previous code. In other words, we <span>fix the algorithm published before</span> and come up with a <span>more novel and intelligent</span> software. Both the greedy algorithm optimizer <span>NucOpt Progs</span> and genetic algorithm optimizer <span>NuPGO</span> are published by us for users’ convenient.</br> This means that <span>anyone</span>, especially iGEM teams in the future and researchers in labs, can <span>easily get started</span> using our program based on a tutorial. <span>Input your initial data</span> and conditions visually and easily and it will be on. Those interested are also welcomed to read our source code and make some comments or modifications, and maybe a stronger NuPGO will be come up with next year."; | ||
− | + | document.getElementById("p4_1").innerHTML=" Unfortunately, the NuPGO-optimized promoter has some mutation of around hundreds of bp, requiring chemical synthesis to obtain the new promoter gene fragment. Moreover, it is not statistically representative of testing only the final variant. To prove that the strength of the NuPGO optimized promoter sequence is consistent with the predicted results, we need to select at least 5 to 10 promoter variants, and conduct chemical synthesis and characterization experiments to verify their strength. Due to the lack of project funds and time constraints, we can not perform characterization experiments to ascertain the strength of the sequence obtained by the optimization algorithm. </br> If conditions permit or any other team was interested in verifying the effect of NuPGO, we would like to or recommend selecting the promoter variants with 50, 100, 200, 400, 500bp mutations in the same run, also pick three variants at each of the position, such as 499, 500, 501bp. Then design the parallel characterization experiments to see the strength difference. The descent curve could also be consulted to obtain the selection choice strategy.</br> Nevertheless, we have reason to believe that the results of the optimization are relatively reliable, because, in our references(Fig.6), the authors conducted experiments on the strength characterization of the promoter variants obtained by the greedy algorithm. We can conclude from the experimental results that nucleosome affinity optimization indeed can increase the promoter's strength and thus enhance expression."; | |
− | + | ||
− | + | ||
− | + | document.getElementById("p2_1").innerHTML=" We publish our source code and packages on Github. We extraordinarily sincerely and strongly recommend that every future iGEM team or researcher using a eukaryotic promoter use our software to achieve a higher expressing level and rational use of UAS strong promoters."; | |
+ | } | ||
− | + | } | |
+ | var btn3=document.getElementById("cl"); | ||
+ | btn3.onclick=changeLanguage; | ||
+ | </script> | ||
+ | <!-- 避免JS运行时页面还没有加载出来,所以需要放在body后 --> | ||
+ | <script type="text/javascript"> | ||
+ | var oB=document.getElementById("bg"); | ||
+ | var oL=document.getElementById("lt"); | ||
+ | var oP=document.getElementsByClassName("content_div_text"); | ||
+ | var sum=15; | ||
+ | function bigger(){ | ||
+ | sum+=2; | ||
+ | var i; | ||
+ | for (i = 0; i < oP.length; i++) { | ||
+ | oP[i].style.fontSize=sum+"px";//在JS中不能出现“-”符号 | ||
+ | } | ||
+ | } | ||
+ | oB.οnclick=bigger; | ||
− | + | function small(){ | |
− | + | sum-=2; | |
− | + | for (i = 0; i < oP.length; i++) { | |
− | + | oP[i].style.fontSize=sum+"px";;//在JS中不能出现“-”符号 | |
− | + | // alert(oP.length); | |
− | + | } | |
− | + | ||
− | + | ||
− | + | ||
− | + | } | |
− | + | oL.οnclick=small; | |
− | + | </script> | |
− | + | <!-- 444444444444444444444444444444444444444444444444444444444444444444444以上需要替换444444444444444444444444444444444444444444444444444444444444444 --> | |
− | + | <!-- 444444444444444444444444444444444444444444444444444444444444444444444以上需要替换444444444444444444444444444444444444444444444444444444444444444 --> | |
− | + | <!-- -0000000000000000000000000000000000000加载动图代码000000000000000000000000000000000000000 --> | |
− | + | <script type="text/javascript" src="https://2021.igem.org/Template:SCUT-China/Myjs-on-555?action=raw&ctype=text/javascript"></script> | |
− | + | <!-- -0000000000000000000000000000000000000加载动图代码000000000000000000000000000000000000000 --> | |
− | + | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
</html> | </html> |
Revision as of 21:11, 21 October 2021
Loading...
Background
Most eukaryotic genomic DNA is wrapped in nucleosomes, which occlude and strongly distort the wrapped DNA. Transcription factor is a special protein that binds to a specific region of the gene to start transcription downstream. However, in the chromosomal state, the genomic DNA is tightly bound to the nucleosome. It can not reveal the transcribed factor binding sites (TFBS), the upstream activating sequence (UAS) for the promoter. Only when the nucleosome is relaxed or not tightly bound could TF have more opportunity to bind UAS and function effectively.
The affinity of nucleosomes to genomic DNA is determined by the arrangement and combination of sequence bases, so the affinity will be different if the sequence is slightly altered. Therefore, we hope to decrease the histone binding affinity score by introducing mutations to the sequence bases except for UAS and the core region. For this reason, a tool that can calculate the nucleosome affinity as an evaluation function to score the effect of the introduction of base fine-tuning is in urgent need.
Suppose there is a scoring tool that could calculate histone binding affinity theoretically. In that case, we could calculate nucleosome affinity for all the possible arrangements of bases in a fixed length. The promoter sequence in theory with the lowest nucleosome affinity sequence and highest strength could be selected. Of course, this approach has extremely high time complexity and low specificity; that's the reason we choose to combine genetic algorithm to optimize the promoter sequence iteratively.
Fundamental
The hidden Markov model (HMM) has been known for decades[1].
Given the characteristic of DNA sequence, and the accumulated pieces of evidence that DNA sequence itself can highly predict the localization of nucleosome in vivo, also the function of chromatin is closely related to the localization of nucleosome, scientists proposed using a duration Hidden Markov Model (dHMM) to predict the occupation of the corresponding nucleosomes in chromosomal DNA: divided into the nucleosome (N) state and the linker (L) state, where the length of the nucleosome state is fixed at 147 bp and the linker state is variable[2]. It is assumed that the two states must be alternately connected. A complete chromatin sequence must begin and end in the linker state, training a 4th order time-dependent Markov chain for the N state and a homogeneous 4th order Markov chain for the L state.
In the detailed method, they let e = e1, ..., e147 be corresponding to the nucleosome DNA sequence, and let PN be the probability of observing e as a nucleosome, computed as the product of probabilities for both Watson and Crick strands under the 4th order Markov Chain model. They assume that the linker DNA length of a given species has an unknown distribution FL (k) defined for k = 1, ..., τL (the maximum linker length allowed). An observed linker DNA sequence e = e1, ..., ek carries two pieces of information, the length is k bp, and given which, the emitted letters are e1, ..., ek . Let GL (e|k) denote the homogeneous Markov chain model for the linker DNA (again including both strands). Then observing e as a linker DNA has probability
Suppose x = x1, ..., xn is a genomic DNA sequence of length n, where xi = A/C/G/T. Let z = z1, ..., zn be the corresponding hidden state path, where zi = 1 if xi is covered by a nucleosome state, and 0 otherwise. Suppose that the path z = z1, ..., zn partitions x into k consecutive nucleosome or linker state blocks, in which the nucleosome blocks have a uniform length of 147 bp, whereas the length of linker blocks may vary. They denote these blocks as y = y+, ..., yB , and their state identification as s = s1, ..., sB , where si = 1 if yi is nucleosome state, and 0 otherwise. The probability of observing (x, z) is given by
where π0(s1) and πe(sB ) stand for the probabilities that the chain initializes and ends with state s1 and sk respectively, and I is an indicator function. Since they assume that a complete chromatin sequence must start with and end in a linker state, π0(s1 = 0) = πe (sB = 0) = 1. Define the nucleosome occupancy at a specific position i, denoted o i , as the posterior probability that zi = 1, i.e.,
Most importantly, the histone binding affinity score at position i is defined as the log likelihood ratio for the region xi-73, ... xi , ..., xi+73 to be a nucleosome vs. a linker, i.e.,
Given the models PN, GL and FL, the optimal path z can be found by the standard Viterbi algorithm, and the nucleosome occupancy score can be estimated using forward and backward algorithms[2].
The model has been trained on a large number of yeast nucleosome DNA fragments dataset that have been pyrosequencing and on a species by species basis, and the test results are satisfactory[2]. They made the nucleosome occupancy predicting tool into a software tool called NuPoP, which means "nucleosome position predicting", as an R language package for researchers to use.
Strategy
Optimizing design
We designed to obtain the variants with different nucleosome affinity of the promoter sequence by introducing the limited amount of random bases mutation without destroying the UAS sequence and the core region of the promoter.
Actually, a similar iterative optimization model has been proposed, called NucOpt Progs. The theory is technically known as a “Greedy algorithm”, in which all possible random single mutations are enumerated on the initial promoter sequence, and every single sequence variant was calculated the nucleosome affinity by NuPoP to obtain a score. After that, the sequence with the lowest affinity was selected as the winner in this round, then taken as the parent placing into the next round, calculating all mutation possible cases enumeration on this winner and vote in a new champion variant candidate. Finally, the theoretical best optimal sequence which is difficult to gain higher score is obtained[3].
However, after learning and testing NucOpt Progs, we found that this method has many disadvantages. First, the exhaustive enumeration method brings immense time complexity. It takes 36 hours to optimize a sequence of 800 bp length (based on the computer capacity). Secondly, single mutation only and greedy algorithm in use obviously determine that it's extraordinarily easy to get optimization fallen into a local optimal solution. For example, assuming that a single base change is a component of a group of mutations that significantly decrease histone affinity, but rarely calculating the benefit exactly this single base change could bring is unremarkable. In this case, the superior choice would easily be neglected. Thirdly, taking into account the biological effect, other published studies have shown that rarely low affinity is not directly related to the promoter's strength[3]. At the same time, the researchers speculate, the site distribution of nucleosome affinity may play a vital role in the promoter strength.
Given the above three defects, we have designed a software NuPGO which can make up for these shortcomings and achieve a better optimization effect. Our new strategy is to combine NuPoP with the genetic algorithm(Fig.1) using the roulette wheel operator so that the base number of mutations can realize adaptive convergence according to probability instead of only introducing a single random mutation at a time, which not only effectively reduce the time complexity but also take much more possibilities into account, escaping local optimal solution. Considering the biological effect, we additionally set the user-defined region weights especially, which enable the algorithm to place larger or smaller weights on the specific sequence region.
Figure 1. Genetic algorithm schematic diagram
Greedy Algorithm vs Roulette Operator
Greedy algorithm and roulette wheel selection algorithm are both essentially evolutionary algorithms. The greedy algorithm also performs better than other evolutionary algorithms in some cases. Still, its efficiency advantage brought by enumeration couldn't display here, while the multi-factor considering ability in wheel selection algorithm stands out.
Greedy Algorithm
The fatal disadvantage of the previous promoter optimizing algorithm is that only the best performer is selected in each round. Perhaps the author aimed to reduce the time complexity of exhaustive enumeration and simply pursue the best score. But such algorithms can easily trap the results into local optima. For example, the three-point mutation on site ABC decrease the histone affinity more than the three-point mutation on site DEF. However, the affinity decrease of site B mutation is smaller than that of site D mutation, so the greedy algorithm will choose site D mutation strategy as the winner, while site B strategy is obviously better actually.
Moreover, only a random single mutation is introduced in each round. In other words, the mutation number is set to 1, which leads to gigantic amounts of enumerations, and the time complexity has not been sufficiently reduced. The execution of the greedy algorithm is time-consuming. In our test, an 800bp length promoter had to cost 36 hours to run.
Figure 2. Schematic diagram of NucOpt Progs
Roulette Operator
The wheel selection method is also known as the fitness proportionate selection (FPS), which means that the fitness of an individual is directly proportional to the probability of being selected, similar to putting all the individuals on a wheel, the area occupied by each individual is directly proportional to its fitness, that is, the probability of being selected is determined by the area of the plate, namely the fittness of each individual when the wheel spins randomly.
In other words, the fitness is directly proportional to the probability of being selected, turning the wheel N times to select N individuals. Meanwhile, it also means that an individual can be selected repeatedly.
Figure 3. Schematic diagram of genetic algorithm applied to promoter optimizer NuPGO
For random bases mutations introduced in each round, we set the mutation amount as an adaptive parameter. Each base position in the sequence is a probability mutation rather than the unintelligent setting the amount as "1" from beginning to end. To ensure the rationality and accuracy of the optimization, we set the threshold of mutation probability to 0.0005 so that the number of base mutations in each round would not exceed 5. Of course, other choices are worth expanding. By doing this, the time complexity of the model is reduced to some degree, the curve will converge faster when performing the gradient descends, broadening the search, and more combinations would be explored to approach the global optimal solution. As the iteration converges, the mutation probability will gradually approach 0, that is, reaching the final optimized promoter sequence(Fig.3).
Specific Weights
The critical UAS are usually only about 10 bp in length, and the vital goal we want to achieve, precisely speaking, is to reduce the nucleosome affinity around the UAS region and to make the UAS region tend to become the linker region rather than nucleosome region.
The final goal of the algorithm is to decrease the total affinity score of the whole promoter sequence as far as possible, which is likely to ignore the key UAS region mutation and choose other strategies with more significant benefit, the final optimized promoter obtained in such condition may not make use of the advantages of UAS. Specifically, the key UAS is still in the high histone affinity region. To compensate for this shortcoming, we decided to artificially add a regional weight to the score calculation process. According to the formula introduced above, dHMM calculates the nucleosome affinity score for every single site, and then calculates the area under the curve (AUC).
To make some adjustments, we decide to compute the score of every single position with a custom zoom when calculating the specific area that we regard special, such as the area of the crucial UAS, in other words, multiplying the simply calculated score by the user-defined weight to get each final single point score.
Fortunately, the results of testing different weights on the NuPGO algorithm are satisfactory. After adding weights to a specific area, NuPGO will pay more attention to that area. In the final optimized result sequence, this specific region will get better performance of nucleosome affinity degradation.
Sustainable
Ultimately, and most worth noting is that, neither the NuPoP software program which can evaluate DNA’s nucleosome affinity[2] nor the iterative optimization code[3] that uses the greedy algorithm, because of the components missing and internet limitation, following the tutorials provided in the papers, we couldn’t get the complete executable code file and failed to run the optimizing software with greedy algorithm directly, as well as other researchers. It means the use of NuPoP and greedy algorithm optimizer have been difficult for those who lack the computer or bioinformatics study infrastructure.
Therefore, we refactored the implementation code almost completely according to some of its logic, and wrote the genetic algorithm using the roulette wheel operator, which is much easier to read and user-friendly much more than the previous code. In other words, we fix the algorithm published before and come up with a more novel and intelligent software. Both the greedy algorithm optimizer NucOpt Progs and genetic algorithm optimizer NuPGO are published by us for users’ convenient.
This means that anyone, especially iGEM teams in the future and researchers in labs, can easily get started using our program based on a tutorial. Input your initial data and conditions visually and easily and it will be on. Those interested are also welcomed to read our source code and make some comments or modifications, and maybe a stronger NuPGO will be come up with next year.
Result Analysis
Optimization result on our promoter
Figure 4. NuPGO optimization on our project promoter nucleosome affinity
Time-consuming comparison
Optimization effect comparison
Figure 5. Result comparison between Greedy Algorithm and Genetic Algorithm, weights added result and result without weights.
Figure 6. Descent curve of total nucleosome affinity score between Greedy Algorithm and Genetic Algorithm, weights added result and result without weights.
Figure 7. Result comparison between Greedy Algorithm and Genetic Algorithm, weights added result and result without weights.
Figure 8. Descent curve of total nucleosome affinity score between Greedy Algorithm and Genetic Algorithm, weights added result and result without weights.
NuPGO result visualization
Figure 9. Comparison between initial and optimized project promoter M8 sequence
Figure 10. Partial protected UAS region during optimization
NuPGO result verification
Unfortunately, the NuPGO-optimized promoter has some mutation of around hundreds of bp, requiring chemical synthesis to obtain the new promoter gene fragment. Moreover, it is not statistically representative of testing only the final variant. To prove that the strength of the NuPGO optimized promoter sequence is consistent with the predicted results, we need to select at least 5 to 10 promoter variants, and conduct chemical synthesis and characterization experiments to verify their strength. Due to the lack of project funds and time constraints, we can not perform characterization experiments to ascertain the strength of the sequence obtained by the optimization algorithm.
If conditions permit or any other team was interested in verifying the effect of NuPGO, we would like to or recommend selecting the promoter variants with 50, 100, 200, 400, 500bp mutations in the same run, also pick three variants at each of the position, such as 499, 500, 501bp. Then design the parallel characterization experiments to see the strength difference. The descent curve could also be consulted to obtain the selection choice strategy.
Nevertheless, we have reason to believe that the results of the optimization are relatively reliable, because, in our references(Fig.6), the authors conducted experiments on the strength characterization of the promoter variants obtained by the greedy algorithm. We can conclude from the experimental results that nucleosome affinity optimization indeed can increase the promoter's strength and thus enhance expression.
Figure 10. Redesign of native yeast promoters for increased expression by decreasing nucleosome affinity in reference article. DOI: 10.1038/ncomms5002
Availability
We publish our source code and packages on Github. We extraordinarily sincerely and strongly recommend that every future iGEM team or researcher using a eukaryotic promoter use our software to achieve a higher expressing level and rational use of UAS strong promoters.
GithubReferences
[1]L. R. Rabiner, "A tutorial on hidden Markov models and selected applications in speech recognition," in Proceedings of the IEEE, vol. 77, no. 2, pp. 257-286, Feb. 1989, doi: 10.1109/5.18626.
[2]Xi, L., Fondufe-Mittendorf, Y., Xia, L. et al. Predicting nucleosome positioning using a duration Hidden Markov Model. BMC Bioinformatics 11, 346 (2010). https://doi.org/10.1186/1471-2105-11-346
[3] [3]Curran, K., Crook, N., Karim, A. et al. Design of synthetic yeast promoters via tuning of nucleosome architecture. Nat Commun 5, 4002 (2014). https://doi.org/10.1038/ncomms5002
[4]de Boer, C.G., Vaishnav, E.D., Sadeh, R. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat Biotechnol 38, 56–65 (2020). https://doi.org/10.1038/s41587-019-0315-8.