Skip to content

CLI

CHEESE CLI

Once you install CHEESE you should now have access for a CLI tool for the on-prem users. You can test if the installation is working by running cheese and display the possible commands.

 Usage: -c [OPTIONS] COMMAND [ARGS]...                                                                                         
                                                                                                                               
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --help          Show this message and exit.                                                                                 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Commands ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ embeddings             Run CHEESE embeddings CPU computation on an input file.                                              │
│ embeddings_gpu         Run CHEESE embeddings GPU computation for an input file.                                             │
│ generate_license_key   Generate a license key for CHEESE                                                                    │
│ inference              Run CHEESE Inference for an input file.'                                                             │
│ search                 Run CHEESE Search on a file of your choice, and save the search outputs to an output file.           │
│ start_app              Start the CHEESE APP                                                                                 │
│ update                 Get the latest CHEESE version                                                                        │
│ visualize              Visualize molecules in 2D from an input file.                                                        │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Updating CHEESE

To get the latest CHEESE version you can run the command cheese update

CHEESE license file

  1. you can run the command cheese generate_license_key to generate a license key. Please note that the license key is environment specific, i.e, you will need another license file if you want to run CHEESE on another host machine.
  2. Copy the license key and send it to us.
  3. We will give you a JSON license file that should have the same path defined in the LICENSE_FILE environment variable during the installation.

Input and Output File Format

In commands that require an input file, the input file should contain lines of molecules in SMILES format and their IDs in the following format : smiles,id. Here is an example of an input CSV file.

smiles,id
C[C@H](NC(=O)N1CC2(CCC2)C1c1ccc(F)cc1)C1CC,Z5348285396
CC(NC(=O)N1CC2(CCC2)C1c1ccc(F)cc1)C1CC1,Z5348285396
C[C@@H](NC(=O)N1CC2(CCC2)C1c1ccc(F)cc1)C1CC1,Z5348285396

Supported are as well .smi, .sdf and .txt file formats.

Output files (CHEESE Embeddings) are saved in .npy or .parquet format. .npy stands for an array in python library NumPy and .parquet is a columnar storage format from the Apache Parquet ecosystem. Our API provides JSON values as well or a CSV (however for more molecules we strongly recommend using the .npy or .parquet formats).

SMILES Standardization

Our tools expect the input SMILES to be in canonicalized rdkit-compatible format, neutralized if possible. In inference there is an optional canonicalization step that can be enabled by the --canonicalize_smiles flag. We recommend standardization function like this (which was used during CHEESE model training). In casual applications the standardization step can be skipped, but it is always better to have the input in a consistent standardized format.

from rdkit import Chem
from rdkit.Chem import rdMolStandardize

def standardize(smiles):
    """
    follows the steps in https://github.com/greglandrum/RSC_OpenScience_Standardization_202104/blob/main/MolStandardize%20pieces.ipynb
    as described **excellently** (by Greg) in https://www.youtube.com/watch?v=eWTApNX8dJQ
    Source: https://bitsilla.com/blog/2021/06/standardizing-a-molecule-using-rdkit/
    """
    mol = Chem.MolFromSmiles(smiles)
     
    # removeHs, disconnect metal atoms, normalize the molecule, reionize the molecule
    clean_mol = rdMolStandardize.Cleanup(mol) 
     
    # if many fragments, get the "parent" (the actual mol we are interested in) 
    parent_clean_mol = rdMolStandardize.FragmentParent(clean_mol)
         
    # try to neutralize molecule
    uncharger = rdMolStandardize.Uncharger() # annoying, but necessary as no convenience method exists
    uncharged_parent_clean_mol = uncharger.uncharge(parent_clean_mol)

    te = rdMolStandardize.TautomerEnumerator() # idem
    taut_uncharged_parent_clean_mol = te.Canonicalize(uncharged_parent_clean_mol)
     
    return Chem.MolToSmiles(taut_uncharged_parent_clean_mol)

Index Inference

Required for Indexing Custom Database

For users of CHEESE Search wishing to index their own database and search it in the UI, API or CLI, inference step is required. This step is necessary to generate the embeddings and search indexes for the molecules in your database

The CLI tool supports running CHEESE inference on your custom database. You can just run the command cheese inference and you can check the available options by running cheese inference --help

 Usage: -c inference [OPTIONS]                                                                                         
                                                                                                                       
 Run CHEESE Inference for an input file.'                                                                              
                                                                                                                       
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --input_file                                         TEXT  The input file in CSV format. Please provide a CSV    │
│                                                               file in the following format : 'SMILES,ID             │
│                                                               [default: None]                                       │
│                                                               [required]                                            │
│    --dest                                               TEXT  Destination folder where to save the results. Will be │
│                                                               inside your source folder                             │
│                                                               [default: output]                                     │
│    --index_type                                         TEXT  Index type : clustered, in_memory, auto               │
│                                                               [default: auto]                                       │
│    --gpu_devices                                        TEXT  List of GPU devices on which to run computation : e.g │
│                                                               '0,3,2'                                               │
│                                                               [default: 0]                                          │
│    --validate_smiles        --no-validate_smiles              Whether to validate the SMILES of the input file      │
│                                                               [default: no-validate_smiles]                         │
│    --canonicalize_smiles    --no-canonicalize_smiles          Whether to canonicalize the SMILES of the input file  │
│                                                               [default: no-canonicalize_smiles]                     │
│    --help                                                     Show this message and exit.                           │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Please note that the index type is defined automatically by default. If the input file exceeds 1GB in size, the script will automatically run the clustered inference, otherwise it will run the in_memory inference.

Example

cheese inference --input_file '/data/mydb.csv' --dest /path/to/my_output --index_type in_memory

Embeddings Computation

CHEESE CLI supports large scale embedding computation on CPU or GPU using CHEESE models by running the command cheese embeddings or cheese embeddings_gpu. You can supply an input file of molecules, a destination folder to save the embeddings and the search type. You can check the available options by running cheese embeddings_gpu --help

cheese embeddings_gpu --help
                                                                                
 Usage: -c embeddings_gpu [OPTIONS]                                             
                                                                                
 Run CHEESE embeddings GPU computation for an input file.                       
                                                                                
╭─ Options ────────────────────────────────────────────────────────────────────╮
│ *  --input_file         TEXT  The input file in CSV format. Please provide a │
│                               CSV file in the following format : 'SMILES,ID  │
│                               [default: None]                                │
│                               [required]                                     │
│    --search_type        TEXT  Type of embeddings : morgan, espsim_shape,     │
│                               espsim_electrostatic, active_pairs, all        │
│                               [default: all]                                 │
│    --gpu_devices        TEXT  List of GPU devices on which to run            │
│                               computation : e.g '0,3,2'                      │
│                               [default: 0]                                   │
│    --save_format        TEXT  Save format of the embeddings. Can be          │
│                               'parquet' or 'numpy'                           │
│                               [default: numpy]                               │
│    --dest               TEXT  Destination folder. Will be inside your source │
│                               folder.                                        │
│                               [default: computed_embeddings]                 │
│    --help                     Show this message and exit.                    │
╰──────────────────────────────────────────────────────────────────────────────╯
Example

cheese embeddings_gpu --input_file '/data/mydb.smi' --dest /data/my_embeddings --search_type active_pairs

Multi-GPU inference speed

CHEESE CLI supports searching in your available databases by running the command cheese search. You can supply an input file of molecules an output CSV folder to save the search results, together with other search parameters. You can check the available options by running cheese search --help

 Usage: -c search [OPTIONS]                                                                                                    
                                                                                                                               
 Run CHEESE Search on a file of your choice, and save the search outputs to an output file.                                    
                                                                                                                               
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --input_file            TEXT     The input file in one of the following formats : .csv, .sdf, .smi or .txt               │
│                                     [default: None]                                                                         │
│                                     [required]                                                                              │
│ *  --output_file           TEXT     The output file in CSV format [default: None] [required]                                │
│    --db_names              TEXT     Databases to search in separated by ','. e.g 'ENAMINE-REAL,ZINC15'                      │
│                                     [default: ENAMINE-REAL]                                                                 │
│    --search_type           TEXT     Search type. Can be : 'morgan', 'espsim_shape','espsim_electrostatic', 'active_pairs'   │
│                                     [default: morgan]                                                                       │
│    --search_quality        TEXT     Search quality. Can be : 'fast', 'accurate','very accurate' [default: fast]             │
│    --n_neighbors           INTEGER  Number of results to retrieve. [default: 30]                                            │
│    --help                           Show this message and exit.                                                             │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Example

cheese search --input_file '/data/myqueries.smi' --output_file '/data/results.csv' --db_names 'ZINC15,CUSTOM_DB' --search_type morgan --search_quality accurate --n_neighbors 100

CHEESE Visualization

CHEESE CLI supports visualizing molecules in 2D by running the command cheese visualize. You can supply an input file of molecules, a destination folder to save the coordinates, together with the visualization method (PCA or UMAP). You can check the available options by running cheese visualize --help

 Usage: -c visualize [OPTIONS]                                                                                                 
                                                                                                                               
 Visualize molecules in 2D from an input file.                                                                                 
                                                                                                                               
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *  --input_file        TEXT  The input file in one of the following formats : .csv, .sdf, .smi or .txt [default: None]      │
│                              [required]                                                                                     │
│    --dest              TEXT  Destination path to save embeddings [default: computed_coordinates]                            │
│    --sim_name          TEXT  Similarity type. Can be : 'morgan', 'espsim_shape','espsim_electrostatic', 'active_pairs'      │
│                              [default: morgan]                                                                              │
│    --method            TEXT  Visualization method. Can be : 'umap' or 'pca' [default: umap]                                 │
│    --help                    Show this message and exit.                                                                    │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Example

cheese visualize --input_file '/data/myqueries.smi' --dest '/data/mymols_viz' --sim_name 'espsim_shape' --method pca