Team:UCAS-China/Model Methods

Metabolism and personalization

Basic equations

For the caffeine intake, we introduce a model similar to drug discovery model in ARK.micro (UCAS-China, 2019). As is shown below, when we take in caffeine, the caffeine firstly travels to the digestive system, when some of them will be absorbed into the circulative system. Here we note that although the degradation of caffeine occurs in liver and satisfies Michaelle-Menten kinetics, we can approximately think that the degradation happens in the circulative system when the concentration of caffeine is low enough. Finally, the caffeine travels to the nervous system and influences our body.

Additionally, some other simplifications can be taken. Previous work have shown that the amount of absorbed caffeine is in proportion to the intake, so we can consider that the circulative system receives the intake directly, as is shown below.

Approximately, the kinetics of these concentrations can be described by the ODE below: \begin{align} \dfrac{\mathrm{d}C_c}{\mathrm{d}t} & =-(k_{d1}+k_t)C_c+k_af(t);\\ \dfrac{\mathrm{d}C_n}{\mathrm{d}t} & =k_tC_c-k_{d2}C_n. \end{align} Where $f(t)$ is the intake function. As the size of digestive system is limited, the intake function can be estimated as a square wave, that is, \begin{equation} f(t)=\left\{ \begin{array}{ll} v_0, & 0\lt t\lt t_0; \\ 0, & t\gt t_0. \end{array} \right. \end{equation}

Ideal prediction

As the parameters listed above varies among the population and their values are hard to be obtained, our model need to guess the values from previous data of the user. If we can measure $C_d$ continually, then the values of parameters can be guessed by minimizing \begin{equation} \mu(k_a,k_t,k_{d1},k_{d2})=\int_{t_{\text{curr}}-T_0}^{t_{\text{curr}}}w(t_{\text{curr}}-\tau)[C_n^{\text{model}}(\tau)-C_n^{\text{measure}}(\tau)]^2\mathrm{d}\tau, \end{equation} where $t_{\text{curr}}$ is current time, $T_0$ is the incremental period, $C_d^{\text{model}}$ is the value of $C_d$ calculated by the model, and $C_d^{\text{measure}}$ is the measured value of $C_d$. The function $\mu$ can be minimized by gradient descent method. Particularly, as the temporariness descends over time, time-related weight is introduced. A possible form of $w$ is $w(t)=e^{-\kappa t}$. Then, after a period, the parameters can be refresed by \begin{equation} k_i^{\text{curr}}=w(T_0)k_i^{\text{prev}}+k_i^{\text{optimized}}. \end{equation}

However, it's hard to measure $C_d$ continually, so we need to determine the frequency for hardware to detect $C_d$. In our software design, we will record the time of caffeine intake, the accuracy of which will be 1 minute. That is, the record of a coffee-intake event will be

<start_time>, <end_time>, <total_amount>.

Therefore, the variables above can be descretized, if we assume that we drink coffee in a constant speed. \begin{align} C_c[t]&=a_{1,1}C_c[t-1]+a_{1,3}g[t-1]\\ C_n[t]&=a_{2,1}C_c[t-1]+a_{2,2}C_n[t-1]\\ g[t]&=\dfrac{\lt \text{total_amount}\gt }{\lt \text{end_time}\gt -\lt \text{start_time}\gt}, n\in[\lt \text{start_time}\gt, \lt \text{end_time}\gt) \end{align}

After constructing models, we need to determine the coefficients $a_{i,j}$ above.

Feedback from users

However, in real situations, we might only access sleeping data. For example, some portable devices can classify sleep status by heart rate, movement, and so on. Usually there are four states: active, rapid eye movement (REM), light, and deep (based on Garmin Connect. The states may vary depending on devices). Therefore, for users' feedback, when we only take sleep into account, the output value of each timescale is one of the states. In general, to personalizedly fit users' data from history and instruct users how to drink coffee, a neural-network-based classifier is introduced here. The framework is shown below.

In order to fit users' routines, we use the difference between current week and past weeks (denoted as baseline above) instead of using data from current week only. The baseline will refresh every week and the rule to refresh is baseline := w*data_last_week + (1-w)*baseline. The intake data are sent twice. The blue path will only receive an array of discrete values, which is the sign function$^{(*)}$ of input to indicate the placebo effect. The red path describes Eqs.(6)-(7) if we eliminate $C_c[t-1]$ and $C_n[t-1]$ in Eqs.(6)-(7), respectively. Therefore, we can discover a cumulative effect caused by coffee intake, and that's why we use convolutional neural networks.$^{(**)}$ After that, data from sleep history, placebo effect and caffeine effect are concatenated and probabilities of sleeping states are calculated. Finally, it is multiplied with baseline frequency and Softmax function is used to generate the probability.

(*) Usually a sign function is hard to calculate its derivative, therefore, we usually use smooth functions such as sigmoid to fit it. Recently, there are some new approaches to such problems, the datails of which can be seen on refs.[1] and [2] and so on.

(**) Actually a recurrent neural network can work as well, but it is more time-consuming than a convolutional neural network. Also, we can set the convolutional kernel to $h[t]=[a\quad aq\quad aq^2\quad\cdots\quad aq^{\tau}]$, where $a$ and $q$ are to be trained. In theory, it can also realize the effect of a recurrent neural network.

Perspectives

Recall that we introduced a feedback control system in section 2 to control the flow rate from the input concentration and target concentration of caffeine. In addition, we introduced a neural-network-based method to train the model from a particular user's historical data. As the models can be expressed as some transitional matrices, when we specify the probability distribution of sleep status for some time period, by calculating the inverse of the matrices and keeping the baseline data, we may obtain the estimated values of caffeine intake and their confidence intervals. Therefore, if a user wants to get rid of caffeine, he/she can plan for the caffeine intake according to the algorithm above by setting the future sleep data similar to historical ones (that means to have a relatively stable lifestyle) and he/she can set the quantity of caffeine intake to the lower bound of the given confidence intervals, or to interact with our loophole complex (see section 2) directly, so that such data can be treated as $c_{out}$ of each coffee-intake event automatically.

Another consideration of our model is that the prediction results are rough when the users only use the product for a few weeks. Therefore, another goal is to classify all the users who are willing to share data to the server. When the database is constructed, we might be able to classify the users by clustering analysis and calculate the mean values of the parameters in the model in each cluster. When a user joins our product, he/she will be primarily classified by his/her fundamental data, for example, height, weight, gender, and some aspects of lifestyle. At first the parameters are mean values of his/her clusters, but when he/she keeps using our product, parameters will gradually be replaced by which trained from his/her own data.

Protein thermal stability optimization

In order to maintain the flavor of coffee as much as possible, we hope to increase the reaction temperature of the system to at least 70$^{\circ}$C, but the optimal reaction temperature of wild-type CkTcS is only 40$^{\circ}$C. We want to increase the thermal stability of the protein without affecting the catalytic efficiency of CkTcS.

Normal methods to improve protein thermal stability include 'directed evolution', which uses error-prone PCR and other experimental methods to generate a large number of mutants, and then screen these mutants. However, most of the mutants are near-neutral, and many of the rest are harmful. It will cost a large amount of time to obtain a thermally stable and catalytically active protein. To make things worse, we may not get the desired results. Fortunately, Goldenzweig et al. (2016) have developed an algorithm called PROSS to improve protein stability. Using PROSS, we successfully designed the ideal protein in a short time.

The PROSS algorithm is based on phylogeny and energy estimation. Firstly, by analyzing and comparing a large number of homologous sequences, i.e. , MSA analysis, we can find the conserved sites and co-evolution sites in the protein sequence. According to evolutional theory, the most conservative amino acid tends to stabilize the protein or play an important role in the function of the protein and amino acids that rarely or never appear may damage protein stability or biological functions. There may be disulfide bonds, hydrogen bonds and other interactions between co-evolved amino acid sites. These interactions may play a role in maintaining protein conformation. Therefore, by this step, we can screen out harmful mutations and delineate legal mutations.

Next, using Rosetta energy algorithm, we calculate the impact of each legal mutation on the protein energy. If the energy reduction exceeds some threshold, we claim that the mutation may stablize the protein. The threshold is set to reduce 'false positive' mutations. Since a single mutation has little effect on protein stability, using Rosetta combinatorial sequence design, we analyze the combined effect on multiple mutations obtained in the second step and produce 9 representative results.

We obtained 9 results from the PROSS platform, some of which contain few mutations ($\lt5$%), and some produced a large number of mutations ($\gt$20%). Zhang et al. (2020) pointed out important amino acids for the enzyme's executive function. We removed the mutations on these sites, and back-translated the amino acid sequence to DNA sequence. Our experimental group conducted follow-up experiments to verify the activity and thermal stability.

References

[1] Shenglong Zhou et al. Quadratic convergence of smoothing newton’s method for 0/1 loss optimization.

[2] Shenglong Zhou et al. Computing one-bit compressive sensing via double-sparsity constrained optimization.

[3] Goldenzweig A et al. Automated Structure- and Sequence-Based Design of Proteins for High Bacterial Expression and Stability. Mol Cell. 2016 Jul 21;63(2):337-346.