CLI
CHEESE CLI
Once you install CHEESE you should now have access for a CLI tool for the on-prem users. You can test if the installation is working by running cheese
and display the possible commands.
Usage: -c [OPTIONS] COMMAND [ARGS]...
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ embeddings Run CHEESE embeddings CPU computation on an input file. │
│ embeddings_gpu Run CHEESE embeddings GPU computation for an input file. │
│ generate_license_key Generate a license key for CHEESE │
│ inference Run CHEESE Inference for an input file.' │
│ search Run CHEESE Search on a file of your choice, and save the search outputs to an output file. │
│ start_app Start the CHEESE APP │
│ update Get the latest CHEESE version │
│ visualize Visualize molecules in 2D from an input file. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Updating CHEESE
To get the latest CHEESE version you can run the command cheese update
CHEESE license file
- you can run the command
cheese generate_license_key
to generate a license key. Please note that the license key is environment specific, i.e, you will need another license file if you want to run CHEESE on another host machine. - Copy the license key and send it to us.
- We will give you a JSON license file that should have the same path defined in the
LICENSE_FILE
environment variable during the installation.
Input and Output File Format
In commands that require an input file, the input file should contain lines of molecules in SMILES format and their IDs in the following format : smiles,id
. Here is an example of an input CSV file.
smiles,id
C[C@H](NC(=O)N1CC2(CCC2)C1c1ccc(F)cc1)C1CC,Z5348285396
CC(NC(=O)N1CC2(CCC2)C1c1ccc(F)cc1)C1CC1,Z5348285396
C[C@@H](NC(=O)N1CC2(CCC2)C1c1ccc(F)cc1)C1CC1,Z5348285396
Supported are as well .smi, .sdf and .txt file formats.
Output files (CHEESE Embeddings) are saved in .npy or .parquet format. .npy stands for an array in python library NumPy and .parquet is a columnar storage format from the Apache Parquet ecosystem. Our API provides JSON values as well or a CSV (however for more molecules we strongly recommend using the .npy or .parquet formats).
SMILES Standardization
Our tools expect the input SMILES to be in canonicalized rdkit-compatible format, neutralized if possible. In inference there is an optional canonicalization step that can be enabled by the --canonicalize_smiles
flag. We recommend standardization function like this (which was used during CHEESE model training). In casual applications the standardization step can be skipped, but it is always better to have the input in a consistent standardized format.
from rdkit import Chem
from rdkit.Chem import rdMolStandardize
def standardize(smiles):
"""
follows the steps in https://github.com/greglandrum/RSC_OpenScience_Standardization_202104/blob/main/MolStandardize%20pieces.ipynb
as described **excellently** (by Greg) in https://www.youtube.com/watch?v=eWTApNX8dJQ
Source: https://bitsilla.com/blog/2021/06/standardizing-a-molecule-using-rdkit/
"""
mol = Chem.MolFromSmiles(smiles)
# removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule
clean_mol = rdMolStandardize.Cleanup(mol)
# if many fragments, get the "parent" (the actual mol we are interested in)
parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
# try to neutralize molecule
uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists
uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)
te = rdMolStandardize.TautomerEnumerator() # idem
taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)
return Chem.MolToSmiles(taut_uncharged_parent_clean_mol)
Index Inference
Required for Indexing Custom Database
For users of CHEESE Search wishing to index their own database and search it in the UI, API or CLI, inference step is required. This step is necessary to generate the embeddings and search indexes for the molecules in your database
The CLI tool supports running CHEESE inference on your custom database. You can just run the command cheese inference
and you can check the available options by running cheese inference --help
Usage: -c inference [OPTIONS]
Run CHEESE Inference for an input file.'
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --input_file TEXT The input file in CSV format. Please provide a CSV │
│ file in the following format : 'SMILES,ID │
│ [default: None] │
│ [required] │
│ --dest TEXT Destination folder where to save the results. Will be │
│ inside your source folder │
│ [default: output] │
│ --index_type TEXT Index type : clustered, in_memory, auto │
│ [default: auto] │
│ --gpu_devices TEXT List of GPU devices on which to run computation : e.g │
│ '0,3,2' │
│ [default: 0] │
│ --validate_smiles --no-validate_smiles Whether to validate the SMILES of the input file │
│ [default: no-validate_smiles] │
│ --canonicalize_smiles --no-canonicalize_smiles Whether to canonicalize the SMILES of the input file │
│ [default: no-canonicalize_smiles] │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Please note that the index type is defined automatically by default. If the input file exceeds 1GB in size, the script will automatically run the clustered inference, otherwise it will run the in_memory inference.
Example
cheese inference --input_file '/data/mydb.csv' --dest /path/to/my_output --index_type in_memory
Embeddings Computation
CHEESE CLI supports large scale embedding computation on CPU or GPU using CHEESE models by running the command cheese embeddings
or cheese embeddings_gpu
. You can supply an input file of molecules, a destination folder to save the embeddings and the search type. You can check the available options by running cheese embeddings_gpu --help
cheese embeddings_gpu --help
Usage: -c embeddings_gpu [OPTIONS]
Run CHEESE embeddings GPU computation for an input file.
╭─ Options ────────────────────────────────────────────────────────────────────╮
│ * --input_file TEXT The input file in CSV format. Please provide a │
│ CSV file in the following format : 'SMILES,ID │
│ [default: None] │
│ [required] │
│ --search_type TEXT Type of embeddings : morgan, espsim_shape, │
│ espsim_electrostatic, active_pairs, all │
│ [default: all] │
│ --gpu_devices TEXT List of GPU devices on which to run │
│ computation : e.g '0,3,2' │
│ [default: 0] │
│ --save_format TEXT Save format of the embeddings. Can be │
│ 'parquet' or 'numpy' │
│ [default: numpy] │
│ --dest TEXT Destination folder. Will be inside your source │
│ folder. │
│ [default: computed_embeddings] │
│ --help Show this message and exit. │
╰──────────────────────────────────────────────────────────────────────────────╯
cheese embeddings_gpu --input_file '/data/mydb.smi' --dest /data/my_embeddings --search_type active_pairs
Multi-GPU inference speed
Search
CHEESE CLI supports searching in your available databases by running the command cheese search
. You can supply an input file of molecules an output CSV folder to save the search results, together with other search parameters. You can check the available options by running cheese search --help
Usage: -c search [OPTIONS]
Run CHEESE Search on a file of your choice, and save the search outputs to an output file.
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --input_file TEXT The input file in one of the following formats : .csv, .sdf, .smi or .txt │
│ [default: None] │
│ [required] │
│ * --output_file TEXT The output file in CSV format [default: None] [required] │
│ --db_names TEXT Databases to search in separated by ','. e.g 'ENAMINE-REAL,ZINC15' │
│ [default: ENAMINE-REAL] │
│ --search_type TEXT Search type. Can be : 'morgan', 'espsim_shape','espsim_electrostatic', 'active_pairs' │
│ [default: morgan] │
│ --search_quality TEXT Search quality. Can be : 'fast', 'accurate','very accurate' [default: fast] │
│ --n_neighbors INTEGER Number of results to retrieve. [default: 30] │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Example
cheese search --input_file '/data/myqueries.smi' --output_file '/data/results.csv' --db_names 'ZINC15,CUSTOM_DB' --search_type morgan --search_quality accurate --n_neighbors 100
CHEESE Visualization
CHEESE CLI supports visualizing molecules in 2D by running the command cheese visualize
. You can supply an input file of molecules, a destination folder to save the coordinates, together with the visualization method (PCA or UMAP). You can check the available options by running cheese visualize --help
Usage: -c visualize [OPTIONS]
Visualize molecules in 2D from an input file.
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ * --input_file TEXT The input file in one of the following formats : .csv, .sdf, .smi or .txt [default: None] │
│ [required] │
│ --dest TEXT Destination path to save embeddings [default: computed_coordinates] │
│ --sim_name TEXT Similarity type. Can be : 'morgan', 'espsim_shape','espsim_electrostatic', 'active_pairs' │
│ [default: morgan] │
│ --method TEXT Visualization method. Can be : 'umap' or 'pca' [default: umap] │
│ --help Show this message and exit. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Example
cheese visualize --input_file '/data/myqueries.smi' --dest '/data/mymols_viz' --sim_name 'espsim_shape' --method pca