Robust enzyme discovery and engineering with deep learning using CataPro

Robust enzyme discovery and engineering with deep learning using CataPro Robust enzyme discovery and engineering with deep learning using CataPro


Overview of CataPro

During the dataset preparation phase, we first created the initial kcat and Km datasets by collecting all kcat and Km entries, respectively, from the BRENDA32 and SABIO-RK33 databases. Details of the kcat and Km data collection and processing can be found in Supplementary Figs. 1–2, and the analysis of processed kcat and Km data is shown in Supplementary Fig. 3. Samples containing both experimental kcat and Km values were extracted to construct the kcat/Km dataset. To establish a fair benchmark for evaluating the accuracy and generalization capability of enzyme kinetic parameter prediction models, we applied CD-HIT59 based sequence clustering (sequence similarity cutoff is 0.4) within each dataset to form ten enzyme groups. Enzyme sequences and canonical SMILES of substrates were collected from UniProt60 and PubChem61, respectively. Thus, the unbiased ten-fold cross-validation datasets for kcatKm, and kcat/Km were created (Fig. 1a).

Fig. 1: Overview of CataPro.
figure 1
a Establishment of ten-fold unbiased dataset for kcatKm, and kcat/Km. The enzyme sequence similarity between each fold data is lower than 0.4. b Kinetic parameter (kcat or Km) prediction framework based on language model embeddings and molecular fingerprints. ckcat/Km prediction framework based on kcat and Km pre-trained models.

The CataPro model is a neural network-based framework for predicting enzyme kinetic parameters, using amino acid sequences and SMILES as the original representations for enzymes and substrates, respectively. Enzyme information is encoded into a vector having a length of 1024 using the ProtT5-XL-UniRef50 model (referred to as ProtT5)57, similar to UniKP. For substrates, we employed MolT5 embeddings62 and MACCS keys fingerprints jointly as feature representations, having dimensions of 768 and 167, respectively. The enzyme and substrate representations are concatenated into a 1959-dimensional vector, serving as the overall information for the enzyme-catalyzed reaction. This concatenated vector is used as the input to the neural network for predicting the kcat or Km values.

For predicting kcat/Km, some researcher employed the same architecture used for predicting kcat and Km43. However, this approach may not achieve optimal performance, as it makes kcat/Km entirely independent of kcat and Km. To mitigate this concern, we proposed a strategy that incorporates a neural network-based correction term to enhance the accuracy of kcat/Km predictions. Firstly, the kcat and Km models were pre-trained using the unbiased kcat/Km dataset with ten-fold cross-validation. Then, the initial kcat/Km predictions were derived by dividing the predictions from the pre-trained kcat models by those from the pre-trained Km models. Due to the inherent errors in the kcat and Km models, the initial kcat/Km predictions may exhibit substantial inaccuracies. Therefore, a correction model was trained to mitigate this error. The input to the correction term prediction network remains a 1959-dimensional vector representing the enzyme-substrate feature, which is transformed into a 256-dimensional vector after passing through a dense layer. Then, a feature-wise attention layer before the output layer extracts crucial information. The predicted correction, combined with the ratio of kcat to Km, yields the final predicted kcat/Km value. During the training phase of the correction factor model, the parameters of the pre-trained kcat and Km models are held fixed.

Performance of CataPro in k
cat and K
m predictions

In this work, unless otherwise stated, all training and testing were conducted on our unbiased ten-fold cross-validation datasets. Evaluation metrics include Pearson’s correlation coefficient (PCC), Root-Mean-Square-Error (RMSE), and Spearman’s correlation coefficient (SCC). The inclusion of SCC is particularly important as ranking power is crucial in enzyme mining and enzyme engineering. We retrained DLKcat and UniKP on our datasets under the same testing conditions to ensure fair comparison. Other enzyme and substrate representations were also evaluated, including embeddings from the protein language models like ESM2 (esm2_t33_650M_UR50D)63 and SaProt (SaProt_650M_AF2)55, embeddings from chemical molecule language model such as Mole-BERT64, and traditional rule-based representations such as Morgan fingerprints and RDKit fingerprints.

With the unbiased dataset of kcat, CataPro achieved PCC, SCC, and RMSE values of 0.497, 0.495, and 1.329, respectively, which are significantly better than those of DLKcat and UniKP (Fig. 2a–c and Supplementary Fig. 4a). For the Km prediction, the performance of most models was comparable (Fig. 2d–f and Supplementary Fig. 4b). The PCC, SCC, and RMSE achieved by CataPro are 0.633, 0.629, and 0.998, respectively, slightly better than UniKP. These results indicate that predicting kcat is more challenging than predicting Km. This is because kcat represents the maximum number of catalytic reactions an enzyme can catalyze per unit time, encompassing the entire catalytic process, including transition state formation, overcoming energy barriers, and product release, while Km is an inherent property of the enzyme towards the substrate, related to the strength of the enzyme-substrate binding. This further demonstrates that processes involving complex reaction mechanisms are indeed challenging to model directly using only enzyme sequences and substrate structures.

Fig. 2: Performance comparison of various kcat and Km models on 10-fold unbiased datasets.
figure 2

a, b, and c show the PCC, SCC, and RMSE achieved by the kcat prediction models on the kcat dataset. d, e, and f show the PCC, SCC, and RMSE achieved by the Km prediction models on the Km dataset. It is worth noting that all models, including DLKcat and UniKP, were trained (or re-trained) on the 10-fold unbiased datasets for kcat and Km. In panels af, CataPro and CataPro+SaProt are highlighted in red, while other models are represented in blue. g Workflow of Extra Tree-based CataPro. h shows the performance of CataPro (Extra Tree) and UniKP on unbiased and biased datasets. The top two subplots in panel h display the results for the kcat dataset, while the bottom two subplots show the results for the Km dataset. In the four subplots of panel h, the performance of models on the unbiased datasets is highlighted in red, while the performance on the biased datasets is represented in blue. Source data are provided as a Source Data file.

In addition, an interesting phenomenon is that embeddings from ProtT557 are more effective than those from ESM263 in kcat prediction tasks. In terms of substrate representation, MolT5 embeddings, Mole-BERT embeddings, and MACCS keys fingerprints exhibit similar effects. Furthermore, we explored the effect of protein structural features on enzyme kinetic parameter prediction. SaProt55, trained on protein 3D structure, was employed to derive structural characteristics from enzyme structure predicted by AlphaFold2. After incorporating structural features, the model in the kcat prediction task was slightly worse, while RMSE decreased from 1.329 to 1.317. However, in the Km prediction task, structural features did not have a significant impact. Another study has also reported the ineffectiveness of structural features in predicting enzyme kinetic parameters65. This implies there is still considerable room for improvement in structure-based protein pre-trained models. Finally, we separately present the results achieved by models for kcat and Km prediction on wild-type and mutant enzymes, demonstrating that CataPro still maintains substantial advantages over DLKcat and UniKP (Supplementary Figs. 5–6). A recent study reported a model called ProSmith for Km prediction, which achieved a determination coefficient (R2) of 0.41 on a subset where the maximum enzume sequence identity with the training set was below 30%, demonstrating a strong generalization ability40. The similar performance of ProSmith and CataPro in Km prediction may be attributed to both models utilizing transformer-based pretrained models to capture enzyme-substrate interactions, as well as the same data sources (BRENDA and SABIO-RK) for the datasets.

Han et al.43 states that, extra trees are more effective than neural networks in predicting enzyme kinetic parameters. We tested CataPro on the dataset collected by Li et al.41 and found that the extra tree method indeed performed better, especially for the UniKP model (Supplementary Fig. 7). We also conducted additional testing of CataPro and other models on randomly-split ten-fold cross-validation kcat and Km datasets we developed. For this assessment, ten subsets of each dataset contained homologous enzyme data, a scenario similar to most testing setups reported in literature41,43,44. The results demonstrate that UniKP outperformed most models in this testing scenario, especially in the kcat prediction task (Supplementary Fig. 8). Consequently, we employed extra trees algorithm from scikit-learn package to learn the relationship between the features of CataPro and experimental kcat and Km values (Fig. 2g). As anticipated, CataPro trained with extra trees achieved optimal results (Supplementary Fig. 8). We further compared the performance of CataPro (extra tree) and UniKP on unbiased and biased datasets, and found that these two data partitioning methods result in noteworthy differences in model performance, while the two models perform similarly when evaluated on the same dataset (Fig. 2h). This further suggests that random dataset splitting can lead to overly optimistic performance for the model, as enzymes in real-world applications are often likely to be unfamiliar to it. Therefore, the primary advantage of our proposed CataPro lies in utilizing unbiased datasets for training and evaluation, rather than relying on a more sophisticated model architecture.

Given that many enzymatic reactions may involve multiple substrates, we examined the applicability of CataPro to multi-substrate reactions. Recently, Kroll et al. curated a high-quality kcat dataset for wild-type enzyme, which includes enzyme sequences, substrate IDs, reaction equations, and kcat measurements. For each reaction in the test set, we used CataPro to predict the kcat values for the enzyme in combination with each substrate, and the average of these kcat values was taken as the final predicted kcat value for the reaction. CataPro achieved a PCC of 0.661 and an R2 of 0.415 on this test set, which is very close to TurNuP’s performance with a PCC of 0.67 and an R2 of 0.44. This suggests that CataPro also has the potential to handle multi-substrate reactions. Following the method of Kroll et al., we partitioned the test set into four subsets based on the maximal sequence identity to enzymes in the CataPro kcat dataset. The maximum sequence identity intervals for these subsets were 0−40%, 40−80%, 80−99%, and 99−100%. The performance of CataPro in each subset is presented in detail in the Supplementary Fig. 9a. Notably, in the 0−40% identity subset, CataPro achieved a PCC of 0.410 and an R2 of 0.133, showing a considerable gap compared to TurNuP’s R2 of 0.33 in this subset. This suggests that CataPro may still face challenges when dealing with unfamiliar reactions involving multiple substrates, underscoring the importance of evaluating the model in a broader range of application scenarios to avoid overly optimistic results that may arise from relying solely on the 10-fold cross-validation dataset. Additionally, we retrained CataPro on the training set curated by Kroll et al., which contains 3,421 kcat data points, to further explore the potential of the CataPro architecture for tasks involving multi-substrate reactions. For reactions with multiple substrates, the final substrate features were represented by the average of the features of all substrates. We trained CataPro using 5-fold cross-validation, consistent with the folds used by Kroll et al. During inference, the final prediction was obtained by averaging the predictions from the five models. On the TurNuP test set, the retrained CataPro achieved a PCC of 0.672, an R2 of 0.451, and a MSE of 0.783, which are highly consistent with TurNuP’s PCC of 0.67, R2 of 0.44, and MSE of 0.81. In the four subsets divided based on the maximum sequence identity to enzyme sequences in the training set, the retrained CataPro also achieved performance similar to TurNup, with an R2 of 0.333 on the 0−40% subset (Supplementary Fig. 9b). This suggests that when the CataPro architecture is specifically trained on datasets involving multi-substrate reactions, it can achieve performance comparable to the current state-of-the-art (SOTA) models. This provides insights for the future development of enzyme kinetic parameter prediction models utilizing multi-substrate inputs.

Transfer learning improves k
cat/K
m prediction

In this study, the kcat/Km dataset consists of enzyme-substrate pairs containing kcat and Km values, and the distribution of samples across various kcat/Km ranges is shown in Fig. 3a. These samples are stratified into ten folds based on a sequence similarity threshold of 0.4, with label distributions illustrated in Fig. 3b.

Fig. 3: Performance comparison of kcat/Km models.
figure 3

a Distribution of kcat/Km values in the kcat/Km dataset. The kcat/Km dataset consists of samples with concurrent kcat and Km entries, which are divided into ten groups based on enzyme sequence similarity of 0.4. b Distribution of experimental kcat/Km values in each fold of the ten-fold unbiased dataset is shown in different colors, with the white line in the body of the violin plot representing the median. Fold 0 contains 2,584 data points, while each of other nine folds contains 2,583 data points. c Performance of CataPro on the ten-fold unbiased dataset of kcat/Km, with the colorbar representing data density. d, e, and f respectively show the PCC, SCC, and RMSE achieved by the kcat/Km prediction models. CataPro is highlighted in red in panels df. Source data are provided as a Source Data file.

Although kcat/Km can be directly calculated from the predicted kcat and Km, this may not be reasonable because errors in the kcat and Km values can potentially amplify errors in the kcat/Km values. A better approach is to directly use the model to learn kcat/Km values, as exemplified by UniKP. However, this neglects the relationship between the kcat/Km value and the individual kcat and Km values. To overcome these challenges, CataPro adopts an architecture that integrates pre-trained models for kcat and Km prediction with a neural network-based correction term to predict kcat/Km. On the unbiased datasets of kcat/Km, CataPro achieved a PCC of 0.413, SCC of 0.416, and RMSE of 1.619, significantly outperforming DLKcat and UniKP (Fig. 3c–f). The ProtT5_molT5-MACC shown in Fig. 3d–f demonstrates the performance when directly using the architecture shown in Fig. 1b to train kcat/Km. Comparison between CataPro and ProtT5_molT5-MACC demonstrates that the architecture implemented in CataPro results in an increase in PCC from 0.392 to 0.413 and a decrease in RMSE from 1.64 to 1.619. This showcases the effectiveness of the feature combination and architecture adopted by CataPro in predicting kcat/Km. The models was also tested on the randomly partitioned ten-fold cross-validation datasets of kcat/Km, confirming the same conclusions as observed in the previous kcat and Km tests (Supplementary Fig. 10).

CataPro enhances the ranking ability for mutations

In industrial production, optimizing enzymatic catalytic efficiency is crucial for reducing production costs. However, the inherent activity of natural enzymes often falls short of the requirements for high catalytic performance. Consequently, enhancing the activity of natural enzymes through modification has consistently been a primary objective within the enzyme industry. Despite the widespread acknowledgment of this challenge, recent research on predicting kinetic parameters has rarely provided a systematic examination of the models’ capacity to differentiate between mutants in specific catalytic reactions. This study introduces an evaluation metric aimed at assessing the efficacy of enzyme kinetic parameter prediction models in the context of enzyme engineering, underscoring the importance of accurately predicting mutation effects for successful enzyme optimization.

We first selected reactions from each dataset (kcatKm, and kcat/Km datasets) where the number of mutants, including the wild-type exceeded a certain minimum threshold N. The criterion for defining the same reaction was based on identical UniProtID-SMILES pairs for enzyme and substrate. During training on the ten-fold cross-validation datasets, predictions for the wild-type and mutants of selected reactions in the validation set were used for evaluation. Based on the distribution of the number of mutants within a single reaction in the current datasets, we evaluated the cases of N being 20 and 30, respectively, with the corresponding number of reactions shown in Supplementary Figs. 11a-c. Most mutants involved in reactions with N ≥ 20 and N ≥ 30 across the three datasets are disadvantageous, showing worse effects than the wild-type (Supplementary Figs. 11d−f). For each reaction, the SCC between predicted and experimental values for all mutants was calculated as a measure of the model’s ability to rank mutants within that reaction. The average SCC value achieved by the model across all reactions reflects its overall performance in ranking mutants. Figure 4a–c demonstrate that CataPro exhibits superior ranking ability compared to UniKP, particularly in the kcat and kcat/Km datasets (Supplementary Table 1). We also tested the accuracy of the models in selecting the more favorable mutant between any two mutants (including the wild-type) within the same reaction. If a reaction contains N mutants (including the wild-type), there are a total of N(N-1)/2 comparisons. According to Fig. 4d–f and Supplementary Table 2, CataPro still shows a clear overall advantage over UniKP and DLKcat.

Fig. 4: Performance comparison of CataPro and UniKP in enzyme engineering scenarios.
figure 4

a-c illustrate the models’ ability to rank mutants within a specific reaction on the kcatKm, and kcat/Km datasets. To ensure statistical significance of the results, only enzyme-substrate reactions with a number of mutants (including the wild-type) exceeding N (where N is 20 or 30) in each dataset were used for this test. d-f show the accuracy of the models in identifying the better-performing mutants from any two mutants across these three datasets. The box plots depict the distribution of metrics achieved by the model across all reactions where the number of mutants is greater than or equal to N. In each box plot, the lower and upper boundaries of the box represent the first quartile (Q1) and the third quartile (Q3), respectively. The whiskers extend from the quartiles to the minimum and maximum values within 1.5 times the interquartile range. The white circle represents the mean value of each statistic and the black line inside the box represents the median. In the kcatKm, and kcat/Km datasets, the number of reactions with mutants (including the wild-type) where N≥20 are 39, 57, and 30, respectively, while the number of reactions with N≥30 are 10, 16, and 7, respectively. Source data are provided as a Source Data file.

Additionally, we adopted the method proposed by Kroll et al. to evaluate the models’ ability to predict mutation effects66. Specifically, for each enzyme-substrate pair with N ≥ 20 and N ≥ 30, the mean of the measured values (kcatKm, or kcat/Km) for the wild-type and all mutants was first calculated. The experimental and predicted values of each mutant and the wild-type in this reaction were then subtracted by the mean to obtain the experimental mutation effect and the predicted mutation effect, respectively. Supplementary Figs. 12a−c illustrates that CataPro exhibits excellent correlation between predicted and experimental mutation effects for certain reactions, despite the training set containing no enzymes with more than 40% similarity to the enzyme. Conversely, for some reactions, CataPro demonstrates weaker correlations (Supplementary Figs. 12d−f). Notably, even for reactions with high correlation, CataPro demonstrates limitations in accurately predicting the absolute values of mutation effects. To perform a comprehensive evaluation of the model’s ability to predict mutation effect, we concatenated the predicted and experimental mutation effect across all reactions. Unfortunately, in this test, none of CataPro, UniKP, or DLKcat successfully predicted the mutation effects (Supplementary Table 3). This may indicate that precisely capturing the effects of mutations remains a considerable challenge.

Performance of CataPro on external test datasets from diverse sources

To further evaluate the potential of our model in enzyme mining and engineering, we collected four experimentally measured datasets from previously reported studies as additional external test sets. The first two test sets are the Tyrosine ammonia-lyase (TAL) homologue dataset and the TAL engineering dataset, collected from UniKP43, both containing experimentally measured kcat/Km values. TAL enzymes are utilized in the production of aromatic compounds, such as flavonoids, cinnamoyl anthranilates, or plastic precursors67, and identifying their high-activity alternative enzymes and mutants has garnered considerable interest from the community. In a recent study, MPEK also used these two datasets for testing44. In the TAL homologue dataset, IsTAL has the lowest experimental kcat/Km value. Taking IsTAL as a reference, if the model predicts higher kcat/Km values for other enzymes than that of IsTAL, the prediction is considered correct (marked in green in Fig. 5a); otherwise, it is considered incorrect (marked in orange in Fig. 5a). It should be noted that in these two test datasets, except for a five-site mutant of RgTAL present in our kcat/Km dataset, the maximum sequence similarities of all other test sequences to those in the kcat/Km dataset were less than 0.67. The results in Fig. 5a indicate that CataPro has a 50% success rate in identifying TAL enzymes with higher activity than IsTAL. The TAL engineering dataset includes the wild-type RgTAL and its nine mutants. Among them, MT-587V, MT-10Y, and MT-489T exhibit enhanced catalytic efficiency compared to the wild-type, whereas the remaining five mutants demonstrate diminished catalytic efficiency. The results depicted in Fig. 5b illustrate that successful predictions were made for eight out of nine mutants, using the wild-type as a reference.

Fig. 5: Performance of CataPro on the small datasets measured in previously reported experiments.
figure 5

a-d Predicted kcat/Km values on the TAL homologue dataset, the TAL engineering dataset, the DERA dataset, and the BH1352 dataset, respectively. The black dots represent reference, while the green and orange dots denote correctly and incorrectly predicted samples, respectively. In other words, if CataPro successfully predicts that the activity of a specific sequence or mutant is higher or lower than that of the reference, it is marked in green; otherwise, it is marked in orange. Source data are provided as a Source Data file.

The other two small datasets invlove the activity of D-2-deoxyribose-5-phos-phate aldolase (DERA) from Escherichia coli and BH1352 from Bacillus halodurans in catalyzing the D-2-deoxyribose-5-phosphate (DRP) reaction68,69. Catalyzing the DRP reaction is a crucial biocatalytic step in the conversion of renewable raw materials into valuable chemicals such as non-natural diol-1,3-BDO, which is used in the synthesis of polymers, pheromones, fragrances, insecticides, and antibiotics. In the DERA dataset, all mutants exhibited lower activity than the wild-type. Interestingly, CataPro was able to identify the decrease in enzyme activity resulting from these mutations (Fig. 5c), despite the presence of five reaction data points for DERA catalyzing other substrates in the kcat/Km dataset. The BH1352 dataset consists of nine single-point mutants and one double-point mutant, with the BH1352 enzyme exhibits a maximum sequence identity of only 0.54 to the enzymes in our kcat/Km dataset. With the BH1352 dataset, CataPro achieved an 80% prediction success rate (Fig. 5d). These results indicate that CataPro exhibits robust performance in datasets from diverse sources, once again confirming its strong potential for practical applications.

Performance of CataPro on deep mutation scanning datasets

Traditional experimental methods usually assess only dozens of mutations at a time, which is trivial compared to the entire mutational landscape of the protein. Deep mutational scanning (DMS), an approach that integrates genotype and phenotype, enables the evaluation of fitness in up to one million protein variants in a single experiment58. While enzyme kinetic parameters are not fully correlated with fitness, which is influenced by diverse factors such as stability, folding efficiency, catalytic activity, binding specificity, and affinity70, evaluating enzyme kinetic parameter models on the genotype-phenotype datasets remains meaningful.

To evaluate the performance of the predictors on mutants across a broader range of sites, we collected DMS data for four enzymes (with their corresponding substrates) for testing. The first dataset is the EcTL dataset, which includes the TEM − 1 β-lactamase expressed in Escherichiacoli (Fig. 6a). This enzyme hydrolyzes β-lactam antibiotics, such as ampicillin, thus conferring antibiotic resistance to bacteria71. This dataset contains fitness data for 5,468 mutants across 286 sites. The other three datasets involve SsIGPS from Sulfolobussolfataricus, TmIGPS from Thermotogamaritima, and TtIGPS from Thermusthermophilus, which are indole-3-glycerol phosphate synthases (IGPS) from diverse organisms72 (Fig. 6b–d). Despite their similar functions, these IGPS enzymes exhibit low sequence similarity. Each of these IGPS enzymes has fitness data for approximately 1,513 mutants involving 80 sites.

Fig. 6: Performance of CataPro and baseline models on the DMS datasets.
figure 6

a-d respectively show the 3D structures of the EcTL, TmIGPS, TtIGPS, and SsIGPS enzymes, as generated by AlphaFold2. Residues are colored based on the proportion of advantageous mutants at each site (blue-white-red). A redder color indicates a higher proportion of advantageous mutants, while a bluer color indicates a lower proportion. Residues lacking fitness data are also marked in blue. e-h respectively show the SCCs achieved by predictors on the EcTL, TmIGPS, TtIGPS, and SsIGPS datasets. In fact, Km values and PSSM scores are inversely relative to activity. For clarity in comparison, we depict the negative values of the original SCC achieved by Km values from CataPro and UniKP, as well as PSSM scores, in the bar chart. UniKP and DLKcat correspond to the versions reported in the original paper. Source data are provided as a Source Data file.

We evaluated the performance of CataPro, UniKP, and DLKcat on these four DMS datasets, as these datasets encompass mutations at various sites either proximal or distal to the catalytic sites. This extensive mutation space better reveals the real performance of the models. In addition, the Position-Specific Scoring Matrix (PSSM) was used as an additional mutation-function prediction baseline. PSSM incorporates protein sequence co-evolution information, with each element representing the frequency of a specific amino acid occurring at a particular position across different homologous sequences. The conservation and mutation information encapsulated in the PSSM are frequently utilized as descriptors for protein engineering73. Interestingly, the SCCs achieved by the kcat and kcat/Km models of CataPro either approach or surpass those obtained using PSSM scores (Fig. 6e–h). For UniKP and DLKcat, the performance of the kcat model of UniKP is relatively better, particularly on the EcTL dataset, where its performance approaches that of PSSM. To ensure the reliability of CataPro’s results on these four DMS datasets, we calculated the maximum sequence identity between the enzyme sequences in these four test sets and those in our kcatKm, and kcat/Km datasets. With the exception of TtIGPS, which has a maximum sequence identity of 0.48, the other three enzymes are included in our dataset. Since our ten-fold cross-validation datasets are clustered based on a protein sequence similarity threshold of 0.4, each enzyme and its similar enzyme sequences are only present in the same subset. Supplementary Table 4 presents the performance of all CataPro sub-models for kcat, Km, and kcat/Km on these four DMS datasets. In each DMS dataset, the performance of models trained on subsets where the test enzyme is not included is highlighted in bold in Supplementary Table 4. Interestingly, models that are unfamiliar with the test enzyme are not necessarily worse than those models where the test enzyme is included in the training set. Notably, in the EcTL dataset, the model that had not encountered the EcTL enzyme achieved a SCC of 0.437, ranking second among all the sub-models. These results highlight the robustness of CataPro for protein fitness landscape prediction.

CataPro-assisted enzyme mining and directed evolution

In previous assessments, CataPro exhibited notable superiority over the other models. To determine how it performs in real-world applications, we applied CataPro to discovering enzymes and identifying high activity mutations for vanillin biosynthesis.

Vanillin is an important aromatic aldehyde with a rich milky and vanilla aroma, widely used in industries such as food, beverages, cosmetics, and pharmaceuticals74,75,76,77. Although vanillin is found in many plants, such as vanilla bean pods, its natural production is very low because of the stringent cultivation requirements of these plants, resulting in naturally derived vanillin accounting for less than 1% of the vanillin sold on the market. Over 99% of the vanillin on the market is chemically synthesized using eugenol and guaiacol as substrates77. This method is cost-effective but involves high energy consumption and substantial pollution. In contrast, producing vanillin through biotransformation processes using agricultural residues, lignin, and ferulic acid as substrates complies with food safety requirements set by the United States and the European Union, which stipulates that vanillin synthesized from ferulic acid can be considered natural-equivalent vanillin. A synthetic pathway with strong industrial application potential involves the decarboxylation of ferulic acid by ferulic acid decarboxylase (FDC) to produce 4-VG, followed by oxidation by Caulobacter segnis carotenoid cleavage oxygenases (CSO2)75,78. Here, we demonstrate the high potential of CataPro to assist in the enzyme discovery for highly active oxygenases and mutants for vanillin biosythesis.

To discover more potent 4-VG oxidation enzymes, we employed BLAST to retrieve 1500 sequences with sequence similarity higher than 0.2 to the CSO2 sequence from the UniProt database60. Subsequently, sequences with a length difference higher than 20 amino acids compared to the CSO2 sequence were removed. The remaining sequences were screened based on the Km and kcat values predicted by CataPro, and the top 150 enzymes were retained. As protein function depends on its structure, and structural-based clustering is a useful strategy for enzyme discovery19, we calculated the structural similarity between these enzymes and CSO2 using TM-Align79. The top half of the enzymes were clustered through t-SNE to produce the final five representative enzymes (Fig. 7a). These five enzymes, PpCSO, MgpCSO, PgCSO, SsCSO, and TkCSO, are derived from Pseudomonas putida, marine gamma proteobacterium, Pseudomonas gingeri, Sphingobium sp., and Trebonia kvetii, respectively. Among them, only PpCSO is included in the CataPro dataset, while the maximum sequence identity between the other four enzymes and those in our dataset is below 44%. We measured the activity of these five enzymes through experiments and found that the activity of SsCSO was 19.53 times higher than the reference enzyme CSO2 (Fig. 7c).

Fig. 7: Results of CataPro in the mining and engineering of Carotenoid Cleavage Oxygenases.
figure 7

a and b illustrate the workflows for enzyme discovery and engineering, respectively. c The bar group on the left shows the relative activity of the mined candidate enzymes compared to CSO2. The middle and right bar groups respectively show the relative activity of SsCSO mutants compared to CSO2 in two rounds of enzyme engineering. The top subplot illustrates the reaction of CSO enzymes catalyzing 4-VG to produce vanillin. All activity values are relative to CSO2, with CSO2 activity set to 1. Source data are provided as a Source Data file. d visualizes the locations of several experimentally validated beneficial mutation sites in the SsCSO. Orange spheres denote the mutated residues, the red sphere represents the iron atom, and the pink stick represents 4-VG.

To validate the efficacy of CataPro in enzyme engineering, we utilized CataPro for computationally driven directed evolution of SsCSO. We applied CataPro and PSSM scores to select mutants, ensuring they simultaneously exhibit high activity and evolutionary conservation simultaneously80,81,82. In practical enzyme engineering, maintaining or enhancing the structural stability of enzymes is equally important83,84. To ensure that mutations do not compromise enzyme stability, PSSM serves as a crucial criterion adopted in our approach to restrict the potential protein structure change and reduce the vast mutation sampling space. This strategy has been successfully applied in many enzyme activity and selectivity improvement scenarios81,82,85. Residues situated within the binding pocket and directly interacting with the substrate are often targets for mutation, as they directly affect the enzyme function and activity83,86. Here, we first employed AlphaFold222 to predict the SsCSO structure, followed by molecular docking87 to simulate the enzyme-substrate complex structure. Ninety-one residues within a distance of 12 Å from the substrate (excluding histidines in contact with the iron ion) were selected as mutation sites. Each site was mutated to the other 19 standard amino acids, resulting in a total of 1729 (19  × 91 = 1729) single-point mutants. The top half of the mutants, ranked by the predicted kcat/Km values from CataPro, were retained. Subsequently, we assessed the evolutionary conservation of these mutants using PSSM. If the amino acid at site n mutates from i to j, the change in PSSM score caused by this mutation is defined as \(\Delta {{{{\rm{PSSM}}}}}_{ij}^{n}\) = \({{{{\rm{PSSM}}}}}_{j}^{n}\)\({{{{\rm{PSSM}}}}}_{i}^{n}\). The larger \(\Delta {{{{\rm{PSSM}}}}}_{ij}^{n}\) indicates that the mutation is more consistent with evolutionary trends. Six mutants with \(\Delta {{{{\rm{PSSM}}}}}_{ij}^{n}\) greater than 7 were selected for experimental validation (Fig. 7b). Among these mutants, T216M and M351F exhibit higher activity compared to the wild-type (Fig. 7c).

To further enhance the activity of SsCSO, we conducted the next round of mutations based on the combined mutant T216M-M351F. We selected 22 residues in the loop region of the SsCSO pocket (excluding residues duplicated in the first round of mutations) as new mutation sites, resulting in a total of 418 (19  × 22 = 418) mutants. By combining the predicted kcat/Km values from CataPro and the PSSM scores, six mutants with T216M-M351F as the template were selected for experimental validation (Fig. 7b). The results demonstrated that the mutants T216M-M351F-Q100G and T216M-M351F-V384G exhibited significantly higher activity, being 3.16-fold and 3.34-fold higher than that of the wild-type SsCSO, respectively. When compared to CSO2, the activity improvement is 61.71-fold and 65.23-fold for SsCSO T216M-M351F-Q100G and T216M-M351F-V384G, respectively (Fig. 7c). The relative activity values of all candidate enzymes discovered during enzyme mining and SsCSO mutants generated through enzyme engineering are presented in Supplementary Table 5. Figure 7d illustrates the locations of the final dominant mutation sites. The above experiments demonstrated that CataPro is an effective method for enzyme discovery and engineering.




Source link

Add a comment

Leave a Reply

Your email address will not be published. Required fields are marked *

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use