ligand_scoring
– Ligand scoring functions¶
- class LigandScorer(model, target, model_ligands=None, target_ligands=None, resnum_alignments=False, rename_ligand_chain=False, substructure_match=False, coverage_delta=0.2, max_symmetries=100000.0)¶
Scorer to compute various small molecule ligand (non polymer) scores.
Note
Extra requirements:
Python modules numpy and networkx must be available (e.g. use
pip install numpy networkx
)
LigandScorer
is an abstract base class dealing with all the setup, data storage, enumerating ligand symmetries and target/model ligand matching/assignment. But actual score computation is delegated to child classes.At the moment, two such classes are available:
ost.mol.alg.ligand_scoring_lddtpli.LDDTPLIScorer
that assesses the conservation of protein-ligand contactsost.mol.alg.ligand_scoring_scrmsd.SCRMSDScorer
that computes a binding-site superposed, symmetry-corrected RMSD.
All versus all scores are available through the lazily computed
score_matrix
. However, many things can go wrong… be it even something as simple as two ligands not matching. Error states therefore encode scoring issues. An Issue for a particular ligand is indicated by a non-zero state inmodel_ligand_states
/target_ligand_states
. This invalidates pairwise scores of such a ligand with all other ligands. This and other issues in pairwise score computation are reported instate_matrix
which has the same size asscore_matrix
. Only if the respective location is 0, a valid pairwise score can be expected. The states and their meaning can be explored with code:for state_code, (short_desc, desc) in scorer_obj.state_decoding.items(): print(state_code) print(short_desc) print(desc)
A common use case is to derive a one-to-one mapping between ligands in the model and the target for which
LigandScorer
provides an automated assignment procedure. By default, only exact matches between target and model ligands are considered. This is a problem when the target only contains a subset of the expected atoms (for instance if atoms are missing in an experimental structure, which often happens in the PDB). With substructure_match=True, complete model ligands can be scored against partial target ligands. One problem with this approach is that it is very easy to find good matches to small, irrelevant ligands like EDO, CO2 or GOL. The assignment algorithm therefore considers the coverage, expressed as the fraction of atoms of the model ligand atoms covered in the target. Higher coverage matches are prioritized, but a match with a better score will be preferred if it falls within a window of coverage_delta (by default 0.2) of a worse-scoring match. As a result, for instance, with a delta of 0.2, a low-score match with coverage 0.96 would be preferred over a high-score match with coverage 0.70.Assumptions:
LigandScorer
generally assumes that theis_ligand
property is properly set on all the ligand atoms, and only ligand atoms. This is typically the case for entities loaded from mmCIF (tested with mmCIF files from the PDB and SWISS-MODEL). Legacy PDB files must contain HET headers (which is usually the case for files downloaded from the PDB but not elsewhere).The class doesn’t perform any cleanup of the provided structures. It is up to the caller to ensure that the data is clean and suitable for scoring. Molck should be used with extra care, as many of the options (such as rm_non_std or map_nonstd_res) can cause ligands to be removed from the structure. If cleanup with Molck is needed, ligands should be kept aside and passed separately. Non-ligand residues should be valid compounds with atom names following the naming conventions of the component dictionary. Non-standard residues are acceptable, and if the model contains a standard residue at that position, only atoms with matching names will be considered.
Unlike most of OpenStructure, this class does not assume that the ligands (either in the model or the target) are part of the PDB component dictionary. They may have arbitrary residue names. Residue names do not have to match between the model and the target. Matching is based on the calculation of isomorphisms which depend on the atom element name and atom connectivity (bond order is ignored). It is up to the caller to ensure that the connectivity of atoms is properly set before passing any ligands to this class. Ligands with improper connectivity will lead to bogus results.
Note, however, that atom names should be unique within a residue (ie two distinct atoms cannot have the same atom name).
This only applies to the ligand. The rest of the model and target structures (protein, nucleic acids) must still follow the usual rules and contain only residues from the compound library.
Although it isn’t a requirement, hydrogen atoms should be removed from the structures. Here is an example code snippet that will perform a reasonable cleanup. Keep in mind that this is most likely not going to work as expected with entities loaded from PDB files, as the is_ligand flag is probably not set properly.
Here is an example of how to use setup a scorer code:
from ost.mol.alg.ligand_scoring_scrmsd import SCRMSDScorer from ost.mol.alg import Molck, MolckSettings # Load data # Structure model in PDB format, containing the receptor only model = io.LoadPDB("path_to_model.pdb") # Ligand model as SDF file model_ligand = io.LoadEntity("path_to_ligand.sdf", format="sdf") # Target loaded from mmCIF, containing the ligand target = io.LoadMMCIF("path_to_target.cif") # Cleanup a copy of the structures cleaned_model = model.Copy() cleaned_target = target.Copy() molck_settings = MolckSettings(rm_unk_atoms=True, rm_non_std=False, rm_hyd_atoms=True, rm_oxt_atoms=False, rm_zero_occ_atoms=False, colored=False, map_nonstd_res=False, assign_elem=True) Molck(cleaned_model, conop.GetDefaultLib(), molck_settings) Molck(cleaned_target, conop.GetDefaultLib(), molck_settings) # Setup scorer object and compute lDDT-PLI model_ligands = [model_ligand.Select("ele != H")] sc = SCRMSDScorer(cleaned_model, cleaned_target, model_ligands) # Perform assignment and read respective scores for lig_pair in sc.assignment: trg_lig = sc.target_ligands[lig_pair[0]] mdl_lig = sc.model_ligands[lig_pair[1]] score = sc.score_matrix[lig_pair[0], lig_pair[1]] print(f"Score for {trg_lig} and {mdl_lig}: {score}")
- Parameters:
model (
ost.mol.EntityHandle
/ost.mol.EntityView
) – Model structure - a deep copy is available asmodel
. No additional processing (ie. Molck), checks, stereochemistry checks or sanitization is performed on the input. Hydrogen atoms are kept.target (
ost.mol.EntityHandle
/ost.mol.EntityView
) – Target structure - a deep copy is available astarget
. No additional processing (ie. Molck), checks or sanitization is performed on the input. Hydrogen atoms are kept.model_ligands (
list
) – Model ligands, as a list ofResidueHandle
belonging to the model entity. Can be instantiated with either a :class:list ofResidueHandle
/ost.mol.ResidueView
or ofost.mol.EntityHandle
/ost.mol.EntityView
. If None, ligands will be extracted based on theis_ligand
flag (this is normally set properly in entities loaded from mmCIF).target_ligands (
list
) – Target ligands, as a list ofResidueHandle
belonging to the target entity. Can be instantiated either a :class:list ofResidueHandle
/ost.mol.ResidueView
or ofost.mol.EntityHandle
/ost.mol.EntityView
containing a single residue each. If None, ligands will be extracted based on theis_ligand
flag (this is normally set properly in entities loaded from mmCIF).resnum_alignments (
bool
) – Whether alignments between chemically equivalent chains in model and target can be computed based on residue numbers. This can be assumed in benchmarking setups such as CAMEO/CASP.rename_ligand_chain (
bool
) – If a residue with the same chain name and residue number than an explicitly passed model or target ligand exits in the structure, and rename_ligand_chain is False, a RuntimeError will be raised. If rename_ligand_chain is True, the ligand will be moved to a new chain instead, and the move will be logged to the console with SCRIPT level.substructure_match (
bool
) – Set this to True to allow incomplete (i.e. partially resolved) target ligands.coverage_delta (
float
) – the coverage delta for partial ligand assignment.max_symmetries (
int
) – If more than that many isomorphisms exist for a target-ligand pair, it will be ignored and reported as unassigned.
- property state_matrix¶
Encodes states of ligand pairs
Ligand pairs can be matched and a valid score can be expected if respective location in this matrix is 0. Target ligands are in rows, model ligands in columns. States are encoded as integers <= 9. Larger numbers encode errors for child classes. Use something like
self.state_decoding[3]
to get a decscription.- Return type:
ndarray
- property model_ligand_states¶
Encodes states of model ligands
Non-zero state in any of the model ligands invalidates the full respective column in
state_matrix
.- Return type:
ndarray
- property target_ligand_states¶
Encodes states of target ligands
Non-zero state in any of the target ligands invalidates the full respective row in
state_matrix
.- Return type:
ndarray
- property score_matrix¶
Get the matrix of scores.
Target ligands are in rows, model ligands in columns.
NaN values indicate that no value could be computed (i.e. different ligands). In other words: values are only valid if the respective location in
state_matrix
is 0.- Return type:
ndarray
- property coverage_matrix¶
Get the matrix of model ligand atom coverage in the target.
Target ligands are in rows, model ligands in columns.
NaN values indicate that no value could be computed (i.e. different ligands). In other words: values are only valid if the respective location in
state_matrix
is 0. If substructure_match=False, only full match isomorphisms are considered, and therefore only values of 1.0 can be observed.- Return type:
ndarray
- property aux_matrix¶
Get the matrix of scorer specific auxiliary data.
Target ligands are in rows, model ligands in columns.
Auxiliary data consists of arbitrary data dicts which allow a child class to provide additional information for a scored ligand pair. empty dictionaries indicate that the child class simply didn’t return anything or that no value could be computed (e.g. different ligands). In other words: values are only valid if respective location in the
state_matrix
is 0.- Return type:
ndarray
- property assignment¶
Ligand assignment based on computed scores
Implements a greedy algorithm to assign target and model ligands with each other. Starts from each valid ligand pair as indicated by a state of 0 in
state_matrix
. Each iteration first selects high coverage pairs. Given max_coverage defined as the highest coverage observed in the available pairs, all pairs with coverage in [max_coverage-coverage_delta, max_coverage] are selected. The best scoring pair among those is added to the assignment and the whole process is repeated until there are no ligands to assign anymore.- Return type:
list
oftuple
(trg_lig_idx, mdl_lig_idx)
- property score¶
Get a dictionary of score values, keyed by model ligand
Extract score with something like:
scorer.score[lig.GetChain().GetName()][lig.GetNumber()]
. The returned scores are based onassignment
.- Return type:
dict
- property aux¶
Get a dictionary of score details, keyed by model ligand
Extract dict with something like:
scorer.score[lig.GetChain().GetName()][lig.GetNumber()]
. The returned info dicts are based onassignment
. The content is documented in the respective child class.- Return type:
dict
- property unassigned_target_ligands¶
Get indices of target ligands which are not assigned
- Return type:
list
ofint
- property unassigned_model_ligands¶
Get indices of model ligands which are not assigned
- Return type:
list
ofint
- get_target_ligand_state_report(trg_lig_idx)¶
Get summary of states observed with respect to all model ligands
Mainly for debug purposes
- Parameters:
trg_lig_idx (
int
) – Index of target ligand for which report should be generated
- get_model_ligand_state_report(mdl_lig_idx)¶
Get summary of states observed with respect to all target ligands
Mainly for debug purposes
- Parameters:
mdl_lig_idx (
int
) – Index of model ligand for which report should be generated
- guess_target_ligand_unassigned_reason(trg_lig_idx)¶
Makes an educated guess why target ligand is not assigned
This either returns actual error states or custom states that are derived from them.
- Parameters:
trg_lig_idx (
int
) – Index of target ligand- Returns:
tuple
with two elements: 1) keyword 2) human readable sentence describing the issue, (“unknown”,”unknown”) if nothing obvious can be found.- Raises:
RuntimeError
if specified target ligand is assigned
- guess_model_ligand_unassigned_reason(mdl_lig_idx)¶
Makes an educated guess why model ligand is not assigned
This either returns actual error states or custom states that are derived from them.
- Parameters:
mdl_lig_idx (
int
) – Index of model ligand- Returns:
tuple
with two elements: 1) keyword 2) human readable sentence describing the issue, (“unknown”,”unknown”) if nothing obvious can be found.- Raises:
RuntimeError
if specified model ligand is assigned
- ComputeSymmetries(model_ligand, target_ligand, substructure_match=False, by_atom_index=False, return_symmetries=True, max_symmetries=1000000.0, model_graph=None, target_graph=None)¶
Return a list of symmetries (isomorphisms) of the model onto the target residues.
- Parameters:
model_ligand (
ost.mol.ResidueHandle
orost.mol.ResidueView
) – The model ligandtarget_ligand (
ost.mol.ResidueHandle
orost.mol.ResidueView
) – The target ligandsubstructure_match (
bool
) – Set this to True to allow partial ligands in the reference.by_atom_index (
bool
) – Set this parameter to True if you need the symmetries to refer to atom index (within the residue). Otherwise, if False, the symmetries refer to atom names.max_symmetries (
int
) – If more than that many isomorphisms exist, raise aTooManySymmetriesError
. This can only be assessed by generating at least that many isomorphisms and can take some time.
- Raises:
NoSymmetryError
when no symmetry can be found;NoIsomorphicSymmetryError
in case of isomorphic subgraph but substructure_match is False;TooManySymmetriesError
when more than max_symmetries isomorphisms are found;DisconnectedGraphError
if graph for model_ligand/target_ligand is disconnected.
- exception NoSymmetryError¶
Exception raised when no symmetry can be found.
- exception NoIsomorphicSymmetryError¶
Exception raised when no isomorphic symmetry can be found
There would be isomorphic subgraphs for which symmetries can be found, but substructure_match is disabled
- exception TooManySymmetriesError¶
Exception raised when too many symmetries are found.
- exception DisconnectedGraphError¶
Exception raised when the ligand graph is disconnected.
- class LDDTPLIScorer(model, target, model_ligands=None, target_ligands=None, resnum_alignments=False, rename_ligand_chain=False, substructure_match=False, coverage_delta=0.2, max_symmetries=100000.0, lddt_pli_radius=6.0, add_mdl_contacts=False, lddt_pli_thresholds=[0.5, 1.0, 2.0, 4.0], lddt_pli_binding_site_radius=None)¶
LigandScorer
implementing lDDT-PLI.lDDT-PLI is an lDDT score considering contacts between ligand and receptor. Where receptor consists of protein and nucleic acid chains that pass the criteria for
chain mapping
. This means ignoring other ligands, waters, short polymers as well as any incorrectly connected chains that may be in proximity.LDDTPLIScorer
computes a score for a specific pair of target/model ligands. Given a target/model ligand pair, all possible mappings of model chains onto their chemically equivalent target chains are enumerated. For each of these enumerations, all possible symmetries, i.e. atom-atom assignments of the ligand as given byLigandScorer
, are evaluated and an lDDT-PLI score is computed. The best possible lDDT-PLI score is returned.By default, classic lDDT is computed. That means, contacts within lddt_pli_radius are identified in the target and checked if they’re conserved in the model. Added contacts are not penalized. That means if the ligand is nicely placed in the correct pocket, but that pocket now suddenly interacts with MORE residues in the model, you still get a high score. You can penalize for these added contacts with the add_mdl_contacts flag. This additionally considers contacts within lddt_pli_radius in the model but only if the involved atoms can be mapped to the target. This is a requirement to 1) extract the respective reference distance from the target 2) avoid usage of contacts for which we have no experimental evidence. One special case are contacts from chains that are NOT mapped to the target binding site. It is very well possible that we have experimental evidence for this chain though its just too far away from the target binding site. We therefore try to map these contacts to the chain in the target with equivalent sequence that is closest to the target binding site. If the respective atoms can be mapped there, the contact is considered not fulfilled and added as penalty.
Populates
LigandScorer.aux_data
with followingdict
keys:lddt_pli: The score
lddt_pli_n_contacts: Number of contacts considered in lDDT computation
target_ligand: The actual target ligand for which the score was computed
model_ligand: The actual model ligand for which the score was computed
bs_ref_res:
set
of residues with potentially non-zero contribution to score. That is every residue with at least one atom within lddt_pli_radius + max(lddt_pli_thresholds) of the ligand.bs_mdl_res: Same for model
- Parameters:
model (
ost.mol.EntityHandle
/ost.mol.EntityView
) – Passed to parent constructor - seeLigandScorer
.target (
ost.mol.EntityHandle
/ost.mol.EntityView
) – Passed to parent constructor - seeLigandScorer
.model_ligands (
list
) – Passed to parent constructor - seeLigandScorer
.target_ligands (
list
) – Passed to parent constructor - seeLigandScorer
.resnum_alignments (
bool
) – Passed to parent constructor - seeLigandScorer
.rename_ligand_chain (
bool
) – Passed to parent constructor - seeLigandScorer
.substructure_match (
bool
) – Passed to parent constructor - seeLigandScorer
.coverage_delta (
float
) – Passed to parent constructor - seeLigandScorer
.max_symmetries (
int
) – Passed to parent constructor - seeLigandScorer
.lddt_pli_radius (
float
) – lDDT inclusion radius for lDDT-PLI.add_mdl_contacts (
bool
) – Whether to add mdl contacts.lddt_pli_thresholds (
list
offloat
) – Distance difference thresholds for lDDT.lddt_pli_binding_site_radius (
float
) – Pro param - dont use. Providing a value Restores behaviour from previous implementation that first extracted a binding site with strict distance threshold and computed lDDT-PLI only on those target residues whereas the current implementation includes every atom within lddt_pli_radius.
- class SCRMSDScorer(model, target, model_ligands=None, target_ligands=None, resnum_alignments=False, rename_ligand_chain=False, substructure_match=False, coverage_delta=0.2, max_symmetries=100000.0, bs_radius=4.0, lddt_lp_radius=15.0, model_bs_radius=25, binding_sites_topn=100000, full_bs_search=False)¶
LigandScorer
implementing symmetry corrected RMSD.SCRMSDScorer
computes a score for a specific pair of target/model ligands.The returned RMSD is based on a binding site superposition. The binding site of the target structure is defined as all residues with at least one atom within bs_radius around the target ligand. It only contains protein and nucleic acid residues from chains that pass the criteria for the
chain mapping
. This means ignoring other ligands, waters, short polymers as well as any incorrectly connected chains that may be in proximity. The respective model binding site for superposition is identified by naively enumerating all possible mappings of model chains onto their chemically equivalent target counterparts from the target binding site. The binding_sites_topn with respect to lDDT score are evaluated and an RMSD is computed. You can either try to map ALL model chains onto the target binding site by enabling full_bs_search or restrict the model chains for a specific target/model ligand pair to the chains with at least one atom within model_bs_radius around the model ligand. The latter can be significantly faster in case of large complexes. Symmetry correction is achieved by simply computing an RMSD value for each symmetry, i.e. atom-atom assignments of the ligand as given byLigandScorer
. The lowest RMSD value is returnedPopulates
LigandScorer.aux_data
with followingdict
keys:rmsd: The score
lddt_lp: lDDT of the binding site used for superposition
bs_ref_res:
list
of binding site residues in targetbs_ref_res_mapped:
list
of target binding site residues that are mapped to modelbs_mdl_res_mapped:
list
of same length with respective model residuesbb_rmsd: Backbone RMSD (CA, C3’ for nucleotides) for mapped residues given transform
target_ligand: The actual target ligand for which the score was computed
model_ligand: The actual model ligand for which the score was computed
chain_mapping:
dict
with a chain mapping of chains involved in binding site - key: trg chain name, value: mdl chain nametransform:
geom.Mat4
to transform model binding site onto target binding siteinconsistent_residues:
list
oftuple
representing residues with inconsistent residue names upon mapping (which is given by bs_ref_res_mapped and bs_mdl_res_mapped). Tuples have two elements: 1) trg residue 2) mdl residue
- Parameters:
model (
ost.mol.EntityHandle
/ost.mol.EntityView
) – Passed to parent constructor - seeLigandScorer
.target (
ost.mol.EntityHandle
/ost.mol.EntityView
) – Passed to parent constructor - seeLigandScorer
.model_ligands (
list
) – Passed to parent constructor - seeLigandScorer
.target_ligands (
list
) – Passed to parent constructor - seeLigandScorer
.resnum_alignments (
bool
) – Passed to parent constructor - seeLigandScorer
.rename_ligand_chain (
bool
) – Passed to parent constructor - seeLigandScorer
.substructure_match (
bool
) – Passed to parent constructor - seeLigandScorer
.coverage_delta (
float
) – Passed to parent constructor - seeLigandScorer
.max_symmetries (
int
) – Passed to parent constructor - seeLigandScorer
.bs_radius (
float
) – Inclusion radius for the binding site. Residues with atoms within this distance of the ligand will be considered for inclusion in the binding site.lddt_lp_radius (
float
) – lDDT inclusion radius for lDDT-LP.model_bs_radius (
float
) – inclusion radius for model binding sites. Only used when full_bs_search=False, otherwise the radius is effectively infinite. Only chains with atoms within this distance of a model ligand will be considered in the chain mapping.binding_sites_topn (
int
) – maximum number of model binding site representations to assess per target binding site.full_bs_search (
bool
) – If True, all potential binding sites in the model are searched for each target binding site. If False, the search space in the model is reduced to chains around (model_bs_radius Å) model ligands. This speeds up computations, but may result in ligands not being scored if the predicted ligand pose is too far from the actual binding site.
- SCRMSD(model_ligand, target_ligand, transformation=geom.Mat4(1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1), substructure_match=False, max_symmetries=1000000.0)¶
Calculate symmetry-corrected RMSD.
Binding site superposition must be computed separately and passed as transformation.
- Parameters:
model_ligand (
ost.mol.ResidueHandle
orost.mol.ResidueView
) – The model ligandtarget_ligand (
ost.mol.ResidueHandle
orost.mol.ResidueView
) – The target ligandtransformation (
ost.geom.Mat4
) – Optional transformation to apply on each atom position of model_ligand.substructure_match (
bool
) – Set this to True to allow partial target ligand.max_symmetries (
int
) – If more than that many isomorphisms exist, raise aTooManySymmetriesError
. This can only be assessed by generating at least that many isomorphisms and can take some time.
- Return type:
float
- Raises:
ost.mol.alg.ligand_scoring_base.NoSymmetryError
when no symmetry can be found,ost.mol.alg.ligand_scoring_base.DisconnectedGraphError
when ligand graph is disconnected,ost.mol.alg.ligand_scoring_base.TooManySymmetriesError
when more than max_symmetries isomorphisms are found.