Science News

Robust enzyme discovery and engineering with deep learning using CataPro

March 20, 2025

Overview of CataPro

During the dataset preparation phase, we first created the initial k_cat and K_m datasets by collecting all k_cat and K_m entries, respectively, from the BRENDA³² and SABIO-RK³³ databases. Details of the k_cat and K_m data collection and processing can be found in Supplementary Figs. 1–2, and the analysis of processed k_cat and K_m data is shown in Supplementary Fig. 3. Samples containing both experimental k_cat and K_m values were extracted to construct the k_cat/K_m dataset. To establish a fair benchmark for evaluating the accuracy and generalization capability of enzyme kinetic parameter prediction models, we applied CD-HIT⁵⁹ based sequence clustering (sequence similarity cutoff is 0.4) within each dataset to form ten enzyme groups. Enzyme sequences and canonical SMILES of substrates were collected from UniProt⁶⁰ and PubChem⁶¹, respectively. Thus, the unbiased ten-fold cross-validation datasets for k_cat, K_m, and k_cat/K_m were created (Fig. 1a).

The CataPro model is a neural network-based framework for predicting enzyme kinetic parameters, using amino acid sequences and SMILES as the original representations for enzymes and substrates, respectively. Enzyme information is encoded into a vector having a length of 1024 using the ProtT5-XL-UniRef50 model (referred to as ProtT5)⁵⁷, similar to UniKP. For substrates, we employed MolT5 embeddings⁶² and MACCS keys fingerprints jointly as feature representations, having dimensions of 768 and 167, respectively. The enzyme and substrate representations are concatenated into a 1959-dimensional vector, serving as the overall information for the enzyme-catalyzed reaction. This concatenated vector is used as the input to the neural network for predicting the k_cat or K_m values.

For predicting k_cat/K_m, some researcher employed the same architecture used for predicting k_cat and K_m⁴³. However, this approach may not achieve optimal performance, as it makes k_cat/K_m entirely independent of k_cat and K_m. To mitigate this concern, we proposed a strategy that incorporates a neural network-based correction term to enhance the accuracy of k_cat/K_m predictions. Firstly, the k_cat and K_m models were pre-trained using the unbiased k_cat/K_m dataset with ten-fold cross-validation. Then, the initial k_cat/K_m predictions were derived by dividing the predictions from the pre-trained k_cat models by those from the pre-trained K_m models. Due to the inherent errors in the k_cat and K_m models, the initial k_cat/K_m predictions may exhibit substantial inaccuracies. Therefore, a correction model was trained to mitigate this error. The input to the correction term prediction network remains a 1959-dimensional vector representing the enzyme-substrate feature, which is transformed into a 256-dimensional vector after passing through a dense layer. Then, a feature-wise attention layer before the output layer extracts crucial information. The predicted correction, combined with the ratio of k_cat to K_m, yields the final predicted k_cat/K_m value. During the training phase of the correction factor model, the parameters of the pre-trained k_cat and K_m models are held fixed.

Performance of CataPro in k
_cat and K
_m predictions

In this work, unless otherwise stated, all training and testing were conducted on our unbiased ten-fold cross-validation datasets. Evaluation metrics include Pearson’s correlation coefficient (PCC), Root-Mean-Square-Error (RMSE), and Spearman’s correlation coefficient (SCC). The inclusion of SCC is particularly important as ranking power is crucial in enzyme mining and enzyme engineering. We retrained DLKcat and UniKP on our datasets under the same testing conditions to ensure fair comparison. Other enzyme and substrate representations were also evaluated, including embeddings from the protein language models like ESM2 (esm2_t33_650M_UR50D)⁶³ and SaProt (SaProt_650M_AF2)⁵⁵, embeddings from chemical molecule language model such as Mole-BERT⁶⁴, and traditional rule-based representations such as Morgan fingerprints and RDKit fingerprints.

With the unbiased dataset of k_cat, CataPro achieved PCC, SCC, and RMSE values of 0.497, 0.495, and 1.329, respectively, which are significantly better than those of DLKcat and UniKP (Fig. 2a–c and Supplementary Fig. 4a). For the K_m prediction, the performance of most models was comparable (Fig. 2d–f and Supplementary Fig. 4b). The PCC, SCC, and RMSE achieved by CataPro are 0.633, 0.629, and 0.998, respectively, slightly better than UniKP. These results indicate that predicting k_cat is more challenging than predicting K_m. This is because k_cat represents the maximum number of catalytic reactions an enzyme can catalyze per unit time, encompassing the entire catalytic process, including transition state formation, overcoming energy barriers, and product release, while K_m is an inherent property of the enzyme towards the substrate, related to the strength of the enzyme-substrate binding. This further demonstrates that processes involving complex reaction mechanisms are indeed challenging to model directly using only enzyme sequences and substrate structures.

**Fig. 2: Performance comparison of various k_cat and K_m models on 10-fold unbiased datasets.**

In addition, an interesting phenomenon is that embeddings from ProtT5⁵⁷ are more effective than those from ESM2⁶³ in k_cat prediction tasks. In terms of substrate representation, MolT5 embeddings, Mole-BERT embeddings, and MACCS keys fingerprints exhibit similar effects. Furthermore, we explored the effect of protein structural features on enzyme kinetic parameter prediction. SaProt⁵⁵, trained on protein 3D structure, was employed to derive structural characteristics from enzyme structure predicted by AlphaFold2. After incorporating structural features, the model in the k_cat prediction task was slightly worse, while RMSE decreased from 1.329 to 1.317. However, in the K_m prediction task, structural features did not have a significant impact. Another study has also reported the ineffectiveness of structural features in predicting enzyme kinetic parameters⁶⁵. This implies there is still considerable room for improvement in structure-based protein pre-trained models. Finally, we separately present the results achieved by models for k_cat and K_m prediction on wild-type and mutant enzymes, demonstrating that CataPro still maintains substantial advantages over DLKcat and UniKP (Supplementary Figs. 5–6). A recent study reported a model called ProSmith for K_m prediction, which achieved a determination coefficient (R²) of 0.41 on a subset where the maximum enzume sequence identity with the training set was below 30%, demonstrating a strong generalization ability⁴⁰. The similar performance of ProSmith and CataPro in K_m prediction may be attributed to both models utilizing transformer-based pretrained models to capture enzyme-substrate interactions, as well as the same data sources (BRENDA and SABIO-RK) for the datasets.

Han et al.⁴³ states that, extra trees are more effective than neural networks in predicting enzyme kinetic parameters. We tested CataPro on the dataset collected by Li et al.⁴¹ and found that the extra tree method indeed performed better, especially for the UniKP model (Supplementary Fig. 7). We also conducted additional testing of CataPro and other models on randomly-split ten-fold cross-validation k_cat and K_m datasets we developed. For this assessment, ten subsets of each dataset contained homologous enzyme data, a scenario similar to most testing setups reported in literature^41,43,44. The results demonstrate that UniKP outperformed most models in this testing scenario, especially in the k_cat prediction task (Supplementary Fig. 8). Consequently, we employed extra trees algorithm from scikit-learn package to learn the relationship between the features of CataPro and experimental k_cat and K_m values (Fig. 2g). As anticipated, CataPro trained with extra trees achieved optimal results (Supplementary Fig. 8). We further compared the performance of CataPro (extra tree) and UniKP on unbiased and biased datasets, and found that these two data partitioning methods result in noteworthy differences in model performance, while the two models perform similarly when evaluated on the same dataset (Fig. 2h). This further suggests that random dataset splitting can lead to overly optimistic performance for the model, as enzymes in real-world applications are often likely to be unfamiliar to it. Therefore, the primary advantage of our proposed CataPro lies in utilizing unbiased datasets for training and evaluation, rather than relying on a more sophisticated model architecture.

Given that many enzymatic reactions may involve multiple substrates, we examined the applicability of CataPro to multi-substrate reactions. Recently, Kroll et al. curated a high-quality k_cat dataset for wild-type enzyme, which includes enzyme sequences, substrate IDs, reaction equations, and kcat measurements. For each reaction in the test set, we used CataPro to predict the k_cat values for the enzyme in combination with each substrate, and the average of these k_cat values was taken as the final predicted k_cat value for the reaction. CataPro achieved a PCC of 0.661 and an R² of 0.415 on this test set, which is very close to TurNuP’s performance with a PCC of 0.67 and an R² of 0.44. This suggests that CataPro also has the potential to handle multi-substrate reactions. Following the method of Kroll et al., we partitioned the test set into four subsets based on the maximal sequence identity to enzymes in the CataPro k_cat dataset. The maximum sequence identity intervals for these subsets were 0−40%, 40−80%, 80−99%, and 99−100%. The performance of CataPro in each subset is presented in detail in the Supplementary Fig. 9a. Notably, in the 0−40% identity subset, CataPro achieved a PCC of 0.410 and an R² of 0.133, showing a considerable gap compared to TurNuP’s R² of 0.33 in this subset. This suggests that CataPro may still face challenges when dealing with unfamiliar reactions involving multiple substrates, underscoring the importance of evaluating the model in a broader range of application scenarios to avoid overly optimistic results that may arise from relying solely on the 10-fold cross-validation dataset. Additionally, we retrained CataPro on the training set curated by Kroll et al., which contains 3,421 kcat data points, to further explore the potential of the CataPro architecture for tasks involving multi-substrate reactions. For reactions with multiple substrates, the final substrate features were represented by the average of the features of all substrates. We trained CataPro using 5-fold cross-validation, consistent with the folds used by Kroll et al. During inference, the final prediction was obtained by averaging the predictions from the five models. On the TurNuP test set, the retrained CataPro achieved a PCC of 0.672, an R² of 0.451, and a MSE of 0.783, which are highly consistent with TurNuP’s PCC of 0.67, R² of 0.44, and MSE of 0.81. In the four subsets divided based on the maximum sequence identity to enzyme sequences in the training set, the retrained CataPro also achieved performance similar to TurNup, with an R² of 0.333 on the 0−40% subset (Supplementary Fig. 9b). This suggests that when the CataPro architecture is specifically trained on datasets involving multi-substrate reactions, it can achieve performance comparable to the current state-of-the-art (SOTA) models. This provides insights for the future development of enzyme kinetic parameter prediction models utilizing multi-substrate inputs.

Transfer learning improves k
_cat/K
_m prediction

In this study, the k_cat/K_m dataset consists of enzyme-substrate pairs containing k_cat and K_m values, and the distribution of samples across various k_cat/K_m ranges is shown in Fig. 3a. These samples are stratified into ten folds based on a sequence similarity threshold of 0.4, with label distributions illustrated in Fig. 3b.

**Fig. 3: Performance comparison of k_cat/K_m models.**

Although k_cat/K_m can be directly calculated from the predicted k_cat and K_m, this may not be reasonable because errors in the k_cat and K_m values can potentially amplify errors in the k_cat/K_m values. A better approach is to directly use the model to learn k_cat/K_m values, as exemplified by UniKP. However, this neglects the relationship between the k_cat/K_m value and the individual k_cat and K_m values. To overcome these challenges, CataPro adopts an architecture that integrates pre-trained models for k_cat and K_m prediction with a neural network-based correction term to predict k_cat/K_m. On the unbiased datasets of k_cat/K_m, CataPro achieved a PCC of 0.413, SCC of 0.416, and RMSE of 1.619, significantly outperforming DLKcat and UniKP (Fig. 3c–f). The ProtT5_molT5-MACC shown in Fig. 3d–f demonstrates the performance when directly using the architecture shown in Fig. 1b to train k_cat/K_m. Comparison between CataPro and ProtT5_molT5-MACC demonstrates that the architecture implemented in CataPro results in an increase in PCC from 0.392 to 0.413 and a decrease in RMSE from 1.64 to 1.619. This showcases the effectiveness of the feature combination and architecture adopted by CataPro in predicting k_cat/K_m. The models was also tested on the randomly partitioned ten-fold cross-validation datasets of k_cat/K_m, confirming the same conclusions as observed in the previous k_cat and K_m tests (Supplementary Fig. 10).

CataPro enhances the ranking ability for mutations

In industrial production, optimizing enzymatic catalytic efficiency is crucial for reducing production costs. However, the inherent activity of natural enzymes often falls short of the requirements for high catalytic performance. Consequently, enhancing the activity of natural enzymes through modification has consistently been a primary objective within the enzyme industry. Despite the widespread acknowledgment of this challenge, recent research on predicting kinetic parameters has rarely provided a systematic examination of the models’ capacity to differentiate between mutants in specific catalytic reactions. This study introduces an evaluation metric aimed at assessing the efficacy of enzyme kinetic parameter prediction models in the context of enzyme engineering, underscoring the importance of accurately predicting mutation effects for successful enzyme optimization.

We first selected reactions from each dataset (k_cat, K_m, and k_cat/K_m datasets) where the number of mutants, including the wild-type exceeded a certain minimum threshold N. The criterion for defining the same reaction was based on identical UniProtID-SMILES pairs for enzyme and substrate. During training on the ten-fold cross-validation datasets, predictions for the wild-type and mutants of selected reactions in the validation set were used for evaluation. Based on the distribution of the number of mutants within a single reaction in the current datasets, we evaluated the cases of N being 20 and 30, respectively, with the corresponding number of reactions shown in Supplementary Figs. 11a-c. Most mutants involved in reactions with N ≥ 20 and N ≥ 30 across the three datasets are disadvantageous, showing worse effects than the wild-type (Supplementary Figs. 11d−f). For each reaction, the SCC between predicted and experimental values for all mutants was calculated as a measure of the model’s ability to rank mutants within that reaction. The average SCC value achieved by the model across all reactions reflects its overall performance in ranking mutants. Figure 4a–c demonstrate that CataPro exhibits superior ranking ability compared to UniKP, particularly in the k_cat and k_cat/K_m datasets (Supplementary Table 1). We also tested the accuracy of the models in selecting the more favorable mutant between any two mutants (including the wild-type) within the same reaction. If a reaction contains N mutants (including the wild-type), there are a total of N(N-1)/2 comparisons. According to Fig. 4d–f and Supplementary Table 2, CataPro still shows a clear overall advantage over UniKP and DLKcat.

**Fig. 4: Performance comparison of CataPro and UniKP in enzyme engineering scenarios.**

Additionally, we adopted the method proposed by Kroll et al. to evaluate the models’ ability to predict mutation effects⁶⁶. Specifically, for each enzyme-substrate pair with N ≥ 20 and N ≥ 30, the mean of the measured values (k_cat, K_m, or k_cat/K_m) for the wild-type and all mutants was first calculated. The experimental and predicted values of each mutant and the wild-type in this reaction were then subtracted by the mean to obtain the experimental mutation effect and the predicted mutation effect, respectively. Supplementary Figs. 12a−c illustrates that CataPro exhibits excellent correlation between predicted and experimental mutation effects for certain reactions, despite the training set containing no enzymes with more than 40% similarity to the enzyme. Conversely, for some reactions, CataPro demonstrates weaker correlations (Supplementary Figs. 12d−f). Notably, even for reactions with high correlation, CataPro demonstrates limitations in accurately predicting the absolute values of mutation effects. To perform a comprehensive evaluation of the model’s ability to predict mutation effect, we concatenated the predicted and experimental mutation effect across all reactions. Unfortunately, in this test, none of CataPro, UniKP, or DLKcat successfully predicted the mutation effects (Supplementary Table 3). This may indicate that precisely capturing the effects of mutations remains a considerable challenge.

Performance of CataPro on external test datasets from diverse sources

To further evaluate the potential of our model in enzyme mining and engineering, we collected four experimentally measured datasets from previously reported studies as additional external test sets. The first two test sets are the Tyrosine ammonia-lyase (TAL) homologue dataset and the TAL engineering dataset, collected from UniKP⁴³, both containing experimentally measured k_cat/K_m values. TAL enzymes are utilized in the production of aromatic compounds, such as flavonoids, cinnamoyl anthranilates, or plastic precursors⁶⁷, and identifying their high-activity alternative enzymes and mutants has garnered considerable interest from the community. In a recent study, MPEK also used these two datasets for testing⁴⁴. In the TAL homologue dataset, IsTAL has the lowest experimental k_cat/K_m value. Taking IsTAL as a reference, if the model predicts higher k_cat/K_m values for other enzymes than that of IsTAL, the prediction is considered correct (marked in green in Fig. 5a); otherwise, it is considered incorrect (marked in orange in Fig. 5a). It should be noted that in these two test datasets, except for a five-site mutant of RgTAL present in our k_cat/K_m dataset, the maximum sequence similarities of all other test sequences to those in the k_cat/K_m dataset were less than 0.67. The results in Fig. 5a indicate that CataPro has a 50% success rate in identifying TAL enzymes with higher activity than IsTAL. The TAL engineering dataset includes the wild-type RgTAL and its nine mutants. Among them, MT-587V, MT-10Y, and MT-489T exhibit enhanced catalytic efficiency compared to the wild-type, whereas the remaining five mutants demonstrate diminished catalytic efficiency. The results depicted in Fig. 5b illustrate that successful predictions were made for eight out of nine mutants, using the wild-type as a reference.

**Fig. 5: Performance of CataPro on the small datasets measured in previously reported experiments.**

The other two small datasets invlove the activity of D-2-deoxyribose-5-phos-phate aldolase (DERA) from Escherichia coli and BH1352 from Bacillus halodurans in catalyzing the D-2-deoxyribose-5-phosphate (DRP) reaction^68,69. Catalyzing the DRP reaction is a crucial biocatalytic step in the conversion of renewable raw materials into valuable chemicals such as non-natural diol-1,3-BDO, which is used in the synthesis of polymers, pheromones, fragrances, insecticides, and antibiotics. In the DERA dataset, all mutants exhibited lower activity than the wild-type. Interestingly, CataPro was able to identify the decrease in enzyme activity resulting from these mutations (Fig. 5c), despite the presence of five reaction data points for DERA catalyzing other substrates in the k_cat/K_m dataset. The BH1352 dataset consists of nine single-point mutants and one double-point mutant, with the BH1352 enzyme exhibits a maximum sequence identity of only 0.54 to the enzymes in our k_cat/K_m dataset. With the BH1352 dataset, CataPro achieved an 80% prediction success rate (Fig. 5d). These results indicate that CataPro exhibits robust performance in datasets from diverse sources, once again confirming its strong potential for practical applications.

Performance of CataPro on deep mutation scanning datasets

Traditional experimental methods usually assess only dozens of mutations at a time, which is trivial compared to the entire mutational landscape of the protein. Deep mutational scanning (DMS), an approach that integrates genotype and phenotype, enables the evaluation of fitness in up to one million protein variants in a single experiment⁵⁸. While enzyme kinetic parameters are not fully correlated with fitness, which is influenced by diverse factors such as stability, folding efficiency, catalytic activity, binding specificity, and affinity⁷⁰, evaluating enzyme kinetic parameter models on the genotype-phenotype datasets remains meaningful.

To evaluate the performance of the predictors on mutants across a broader range of sites, we collected DMS data for four enzymes (with their corresponding substrates) for testing. The first dataset is the EcTL dataset, which includes the TEM − 1 β-lactamase expressed in Escherichia coli (Fig. 6a). This enzyme hydrolyzes β-lactam antibiotics, such as ampicillin, thus conferring antibiotic resistance to bacteria⁷¹. This dataset contains fitness data for 5,468 mutants across 286 sites. The other three datasets involve SsIGPS from Sulfolobus solfataricus, TmIGPS from Thermotoga maritima, and TtIGPS from Thermus thermophilus, which are indole-3-glycerol phosphate synthases (IGPS) from diverse organisms⁷² (Fig. 6b–d). Despite their similar functions, these IGPS enzymes exhibit low sequence similarity. Each of these IGPS enzymes has fitness data for approximately 1,513 mutants involving 80 sites.

**Fig. 6: Performance of CataPro and baseline models on the DMS datasets.**

We evaluated the performance of CataPro, UniKP, and DLKcat on these four DMS datasets, as these datasets encompass mutations at various sites either proximal or distal to the catalytic sites. This extensive mutation space better reveals the real performance of the models. In addition, the Position-Specific Scoring Matrix (PSSM) was used as an additional mutation-function prediction baseline. PSSM incorporates protein sequence co-evolution information, with each element representing the frequency of a specific amino acid occurring at a particular position across different homologous sequences. The conservation and mutation information encapsulated in the PSSM are frequently utilized as descriptors for protein engineering⁷³. Interestingly, the SCCs achieved by the k_cat and k_cat/K_m models of CataPro either approach or surpass those obtained using PSSM scores (Fig. 6e–h). For UniKP and DLKcat, the performance of the k_cat model of UniKP is relatively better, particularly on the EcTL dataset, where its performance approaches that of PSSM. To ensure the reliability of CataPro’s results on these four DMS datasets, we calculated the maximum sequence identity between the enzyme sequences in these four test sets and those in our k_cat, K_m, and k_cat/K_m datasets. With the exception of TtIGPS, which has a maximum sequence identity of 0.48, the other three enzymes are included in our dataset. Since our ten-fold cross-validation datasets are clustered based on a protein sequence similarity threshold of 0.4, each enzyme and its similar enzyme sequences are only present in the same subset. Supplementary Table 4 presents the performance of all CataPro sub-models for kcat, Km, and kcat/Km on these four DMS datasets. In each DMS dataset, the performance of models trained on subsets where the test enzyme is not included is highlighted in bold in Supplementary Table 4. Interestingly, models that are unfamiliar with the test enzyme are not necessarily worse than those models where the test enzyme is included in the training set. Notably, in the EcTL dataset, the model that had not encountered the EcTL enzyme achieved a SCC of 0.437, ranking second among all the sub-models. These results highlight the robustness of CataPro for protein fitness landscape prediction.

CataPro-assisted enzyme mining and directed evolution

In previous assessments, CataPro exhibited notable superiority over the other models. To determine how it performs in real-world applications, we applied CataPro to discovering enzymes and identifying high activity mutations for vanillin biosynthesis.

Vanillin is an important aromatic aldehyde with a rich milky and vanilla aroma, widely used in industries such as food, beverages, cosmetics, and pharmaceuticals^74,75,76,77. Although vanillin is found in many plants, such as vanilla bean pods, its natural production is very low because of the stringent cultivation requirements of these plants, resulting in naturally derived vanillin accounting for less than 1% of the vanillin sold on the market. Over 99% of the vanillin on the market is chemically synthesized using eugenol and guaiacol as substrates⁷⁷. This method is cost-effective but involves high energy consumption and substantial pollution. In contrast, producing vanillin through biotransformation processes using agricultural residues, lignin, and ferulic acid as substrates complies with food safety requirements set by the United States and the European Union, which stipulates that vanillin synthesized from ferulic acid can be considered natural-equivalent vanillin. A synthetic pathway with strong industrial application potential involves the decarboxylation of ferulic acid by ferulic acid decarboxylase (FDC) to produce 4-VG, followed by oxidation by Caulobacter segnis carotenoid cleavage oxygenases (CSO2)^75,78. Here, we demonstrate the high potential of CataPro to assist in the enzyme discovery for highly active oxygenases and mutants for vanillin biosythesis.

To discover more potent 4-VG oxidation enzymes, we employed BLAST to retrieve 1500 sequences with sequence similarity higher than 0.2 to the CSO2 sequence from the UniProt database⁶⁰. Subsequently, sequences with a length difference higher than 20 amino acids compared to the CSO2 sequence were removed. The remaining sequences were screened based on the K_m and k_cat values predicted by CataPro, and the top 150 enzymes were retained. As protein function depends on its structure, and structural-based clustering is a useful strategy for enzyme discovery¹⁹, we calculated the structural similarity between these enzymes and CSO2 using TM-Align⁷⁹. The top half of the enzymes were clustered through t-SNE to produce the final five representative enzymes (Fig. 7a). These five enzymes, PpCSO, MgpCSO, PgCSO, SsCSO, and TkCSO, are derived from Pseudomonas putida, marine gamma proteobacterium, Pseudomonas gingeri, Sphingobium sp., and Trebonia kvetii, respectively. Among them, only PpCSO is included in the CataPro dataset, while the maximum sequence identity between the other four enzymes and those in our dataset is below 44%. We measured the activity of these five enzymes through experiments and found that the activity of SsCSO was 19.53 times higher than the reference enzyme CSO2 (Fig. 7c).

**Fig. 7: Results of CataPro in the mining and engineering of Carotenoid Cleavage Oxygenases.**

To validate the efficacy of CataPro in enzyme engineering, we utilized CataPro for computationally driven directed evolution of SsCSO. We applied CataPro and PSSM scores to select mutants, ensuring they simultaneously exhibit high activity and evolutionary conservation simultaneously^80,81,82. In practical enzyme engineering, maintaining or enhancing the structural stability of enzymes is equally important^83,84. To ensure that mutations do not compromise enzyme stability, PSSM serves as a crucial criterion adopted in our approach to restrict the potential protein structure change and reduce the vast mutation sampling space. This strategy has been successfully applied in many enzyme activity and selectivity improvement scenarios^81,82,85. Residues situated within the binding pocket and directly interacting with the substrate are often targets for mutation, as they directly affect the enzyme function and activity^83,86. Here, we first employed AlphaFold2²² to predict the SsCSO structure, followed by molecular docking⁸⁷ to simulate the enzyme-substrate complex structure. Ninety-one residues within a distance of 12 Å from the substrate (excluding histidines in contact with the iron ion) were selected as mutation sites. Each site was mutated to the other 19 standard amino acids, resulting in a total of 1729 (19 × 91 = 1729) single-point mutants. The top half of the mutants, ranked by the predicted k_cat/K_m values from CataPro, were retained. Subsequently, we assessed the evolutionary conservation of these mutants using PSSM. If the amino acid at site n mutates from i to j, the change in PSSM score caused by this mutation is defined as \(\Delta {{{{\rm{PSSM}}}}}_{ij}^{n}\) = \({{{{\rm{PSSM}}}}}_{j}^{n}\) – \({{{{\rm{PSSM}}}}}_{i}^{n}\). The larger \(\Delta {{{{\rm{PSSM}}}}}_{ij}^{n}\) indicates that the mutation is more consistent with evolutionary trends. Six mutants with \(\Delta {{{{\rm{PSSM}}}}}_{ij}^{n}\) greater than 7 were selected for experimental validation (Fig. 7b). Among these mutants, T216M and M351F exhibit higher activity compared to the wild-type (Fig. 7c).

To further enhance the activity of SsCSO, we conducted the next round of mutations based on the combined mutant T216M-M351F. We selected 22 residues in the loop region of the SsCSO pocket (excluding residues duplicated in the first round of mutations) as new mutation sites, resulting in a total of 418 (19 × 22 = 418) mutants. By combining the predicted k_cat/K_m values from CataPro and the PSSM scores, six mutants with T216M-M351F as the template were selected for experimental validation (Fig. 7b). The results demonstrated that the mutants T216M-M351F-Q100G and T216M-M351F-V384G exhibited significantly higher activity, being 3.16-fold and 3.34-fold higher than that of the wild-type SsCSO, respectively. When compared to CSO2, the activity improvement is 61.71-fold and 65.23-fold for SsCSO T216M-M351F-Q100G and T216M-M351F-V384G, respectively (Fig. 7c). The relative activity values of all candidate enzymes discovered during enzyme mining and SsCSO mutants generated through enzyme engineering are presented in Supplementary Table 5. Figure 7d illustrates the locations of the final dominant mutation sites. The above experiments demonstrated that CataPro is an effective method for enzyme discovery and engineering.

Source link