Enzyme Optimization - Training¶
The Enzyme Optimization Platform empowers users to finetune and customize models based on their own uploaded data. Users are required to provide enzyme sequences for optimization along with experimental results of known variants. The platform performs an in-depth analysis of this experimental data, enabling seamless model updates and retraining. Post finetuning, the model will deliver more precise predictions regarding the effects of mutations on enzymatic characteristics such as Enzyme Activity, pH Stability, and Thermostability, thereby enhancing the effectiveness of enzyme modification and optimization efforts.
Features¶
-
Advanced Algorithms: GeoEnzyme leverages a proprietary advanced pretrained large model as its foundation, facilitating efficient fine-tuning to significantly boost overall model performance.
-
One-Click Operation: Users simply need to upload suitable experimental data. GeoEnzyme automatically selects the optimal optimization targets and training parameters, initiating the training process with ease. No prior experience in AI model training is required, allowing users to effortlessly navigate from model training to inference.
-
Customized Services: Building on a general large model, GeoEnzyme specializes in finetuning for specific enzyme data, creating tailored models that cater to distinct enzyme categories. These customized models demonstrate superior performance on targeted objectives compared to the general model.
Inputs¶
To submit an Enzyme Optimization-Training job, please open the Project Editor and select "Enzyme Optimization-Training" from the "Protein Design" dropdown menu.
-
Dataset file: Provide a compilation of existing experimental results to serve as training data for the model, which must be uploaded in CSV format. The data may include the following columns:
-
Sequence column: Contains the sequences of the wildtype and its variants. Ensure that all sequences derive from the same wildtype, with the relevant experimental data for the wildtype listed before all variants. All sequences should have a consistent length.
-
Activity column: Contains the measured enzyme activity data. This can be expressed using kinetic constants such as Kcat or Kcat/Km, or through conversion rates or standard enzyme activity units. Ensure that all data is presented in the same unit for comparability. All blank rows will be automatically ignored during training.
-
pH column: Indicates the pH value of the experimental environment, ranging from 0 to 14. If the pH is unknown, this field can be left blank. To enhance the accuracy of pH stability predictions, it is recommended to provide multiple activity data points for variants under each pH condition.
-
Temp.(℃) column: Specifies the temperature of the experimental environment, ranging from 0 to 100. This field can also be left blank if the temperature is unknown. To improve the accuracy of thermostability predictions, multiple activity data points for variants should be provided under each temperature condition.
-
%e.e. column: Represents the enantiomeric excess (e.e.) of the reaction products.
-
When there is only one chiral center, the e.e. value is the percentage excess of the target product over the by-product. For example, if the target product is S and the by-product is R with an S ratio of 90:10, then e.e. = (S-R)/(S+R) = 80.
-
When multiple chiral centers are present, indicate the excess percentage of the target product relative to the main by-product. For instance, if the target product is SS and the by-products are SR, RS, and RR with a ratio of 90:10:1:5, then e.e. = (SS-SR)/(SS+SR) = 80. The calculation method for e.e. should remain consistent across all variants compared to the wildtype.
-
-
Solubility column: Contains the measured enzyme solubility data. The value must be greater than 0, with higher values indicating better solubility. Ensure the relative magnitude of all data points is accurate.
-
-
Reactants: The reactants in the enzyme-catalyzed reaction, input as SMILES expressions. Enter one SMILES expression per line, supporting chiral molecules. Please refer to Using Ketcher for details.
-
Products: The products in the enzyme-catalyzed reaction, with the same input format as Reactants.
-
By-reactants: When uploading data containing e.e. values, this field indicates the by-reactants in the enzyme-catalyzed reaction, formatted identically to Reactants. The by-reactants should be identical to the reactants except for chirality; if the reactants are achiral or chirality does not affect the reaction, the by-reactants should match the reactants completely.
-
By-Products: When uploading data containing e.e. values, this field specifies the by-products in the enzyme-catalyzed reaction, formatted identically to Reactants. The only difference between the by-products and the products should be in their chiral configuration. If multiple chiral centers are present in the product, please specify the chiral isomers corresponding to the wild-type enzyme-catalyzed products and the main by-products.
-
Job Name: The name of the job. Please note that the job name must be unique within the project.
Models & Parameters¶
Click the Show Parameters button to expand the model and parameter settings.
You can utilize our proprietary GeoEnzyme model as the base model to run this job. The parameters are as follows:
- #epochs: The number of iterations during the model finetuning process, with a maximum value of 20.
Results¶
Click Job Results in the Files & Jobs panel to view the job results.
Trained models¶
Once training is complete, you can use the new model for inference by clicking the "" button to the top-right of the table.
The table provides an overview of the Trained models, which includes the following columns:
-
Task: Specifies the specific task for which model was trained. The model automatically determines the type of training task based on the contents of the uploaded dataset.
-
Best PearsonR: One of the evaluation metrics used during model training. The PearsonR evaluates the linear correlation between predicted values and the ground truth, with values closer to 1 indicating better predictive performance.
-
Best SpearmanR: Another evaluation metric used during model training. The SpearmanR evaluates the correlation between predicted values and the ground truth based on their ranks, with values closer to 1 indicating superior predictive performance.
Evaluation Metrics
During model training, GeoEnzyme prioritizes the SpearmanR over the PearsonR. This is because Spearman focuses on the correlation of rankings, which aligns with GeoEnzyme's objective of accurately ranking mutations across various properties rather than predicting specific values.
Training curve¶
The job results will generate a table for each task, displaying the training curves, which include the following details:
-
Epoch: The specific number of iterations in the model training process, showing the progress made during each round.
-
PearsonR: The PearsonR for the model on the validation set during that epoch of training.
-
SpearmanR: The SpearmanR for the model on the validation set during that epoch of training.
-
Selected: Indicates the final model chosen after training, marked with a "Y" to denote that this model is the best performer.