chain_mapping – Chain Mapping

Chain mapping aims to identify a one-to-one relationship between chains in a reference structure and a model.

class ChainMapper(target, resnum_alignments=False, pep_seqid_thr=95.0, pep_gap_thr=1.0, nuc_seqid_thr=95.0, nuc_gap_thr=1.0, pep_subst_mat=<ost.seq.alg._ost_seq_alg.SubstWeightMatrix object>, pep_gap_open=-11, pep_gap_ext=-1, nuc_subst_mat=<ost.seq.alg._ost_seq_alg.SubstWeightMatrix object>, nuc_gap_open=-4, nuc_gap_ext=-4, min_pep_length=6, min_nuc_length=4, n_max_naive=100000000.0)

Class to compute chain mappings

All algorithms are performed on processed structures which fulfill criteria as given in constructor arguments (min_pep_length, “min_nuc_length”) and only contain residues which have all required backbone atoms. for peptide residues thats N, CA, C and CB (no CB for GLY), for nucleotide residues thats O5’, C5’, C4’, C3’ and O3’.

Chain mapping is a three step process:

  • Group chemically identical chains in target using pairwise alignments that are either computed with Needleman-Wunsch (NW) or simply derived from residue numbers (resnum_alignments flag). In case of NW, pep_subst_mat, pep_gap_open and pep_gap_ext and their nucleotide equivalents are relevant. Two chains are considered identical if they fulfill the thresholds given by pep_seqid_thr, pep_gap_thr, their nucleotide equivalents respectively. The grouping information is available as attributes of this class.

  • Map chains in an input model to these groups. Generating alignments and the similarity criteria are the same as above. You can either get the group mapping with GetChemMapping() or directly call one of the full fletched one-to-one chain mapping functions which execute that step internally.

  • Obtain one-to-one mapping for chains in an input model and target with one of the available mapping functions. Just to get an idea of complexity. If target and model are octamers, there are 8! = 40320 possible chain mappings.

Parameters:
  • target (ost.mol.EntityView/ost.mol.EntityHandle) – Target structure onto which models are mapped. Computations happen on a selection only containing polypeptides and polynucleotides.

  • resnum_alignments (bool) – Use residue numbers instead of Needleman-Wunsch to compute pairwise alignments. Relevant for chem_groups and related attributes.

  • pep_seqid_thr (float) – Threshold used to decide when two chains are identical. 95 percent tolerates the few mutations crystallographers like to do.

  • pep_gap_thr (float) – Additional threshold to avoid gappy alignments with high seqid. By default this is disabled (set to 1.0). This threshold checks for a maximum allowed fraction of gaps in any of the two sequences after stripping terminal gaps. The reason for not just normalizing seqid by the longer sequence is that one sequence might be a perfect subsequence of the other but only cover half of it.

  • nuc_seqid_thr (float) – Nucleotide equivalent for pep_seqid_thr

  • nuc_gap_thr (float) – Nucleotide equivalent for nuc_gap_thr

  • pep_subst_mat (ost.seq.alg.SubstWeightMatrix) – Substitution matrix to align peptide sequences, irrelevant if resnum_alignments is True, defaults to seq.alg.BLOSUM62

  • pep_gap_open (int) – Gap open penalty to align peptide sequences, irrelevant if resnum_alignments is True

  • pep_gap_ext (int) – Gap extension penalty to align peptide sequences, irrelevant if resnum_alignments is True

  • nuc_subst_mat (ost.seq.alg.SubstWeightMatrix) – Nucleotide equivalent for pep_subst_mat, defaults to seq.alg.NUC44

  • nuc_gap_open (int) – Nucleotide equivalent for pep_gap_open

  • nuc_gap_ext (int) – Nucleotide equivalent for pep_gap_ext

  • min_pep_length (int) – Minimal number of residues for a peptide chain to be considered in target and in models.

  • min_nuc_length (int) – Minimal number of residues for a nucleotide chain to be considered in target and in models.

  • n_max_naive (int) – Max possible chain mappings that are enumerated in GetNaivelDDTMapping() / GetDecomposerlDDTMapping(). A RuntimeError is raised in case of bigger complexity.

property target

Target structure that only contains peptides/nucleotides

Contains only residues that have the backbone representatives (CA for peptide and C3’ for nucleotides) to avoid ATOMSEQ alignment inconsistencies when switching between all atom and backbone only representations.

Type:

ost.mol.EntityView

property polypep_seqs

Sequences of peptide chains in target

Respective EntityView from target for each sequence s are available as s.GetAttachedView()

Type:

ost.seq.SequenceList

property polynuc_seqs

Sequences of nucleotide chains in target

Respective EntityView from target for each sequence s are available as s.GetAttachedView()

Type:

ost.seq.SequenceList

property chem_groups

Groups of chemically equivalent chains in target

First chain in group is the one with longest sequence.

Getter:

Computed on first use (cached)

Type:

list of list of str (chain names)

property chem_group_alignments

MSA for each group in chem_groups

Sequences in MSAs exhibit same order as in chem_groups and have the respective ost.mol.EntityView from target attached.

Getter:

Computed on first use (cached)

Type:

ost.seq.AlignmentList

property chem_group_ref_seqs

Reference (longest) sequence for each group in chem_groups

Respective EntityView from target for each sequence s are available as s.GetAttachedView()

Getter:

Computed on first use (cached)

Type:

ost.seq.SequenceList

property chem_group_types

ChemType of each group in chem_groups

Specifying if groups are poly-peptides/nucleotides, i.e. ost.mol.ChemType.AMINOACIDS or ost.mol.ChemType.NUCLEOTIDES

Getter:

Computed on first use (cached)

Type:

list of ost.mol.ChemType

GetChemMapping(model)

Maps sequences in model to chem_groups of target

Parameters:

model (ost.mol.EntityView/ost.mol.EntityHandle) – Model from which to extract sequences, a selection that only includes peptides and nucleotides is performed and returned along other results.

Returns:

Tuple with two lists of length len(self.chem_groups) and an ost.mol.EntityView representing model: 1) Each element is a list with mdl chain names that map to the chem group at that position. 2) Each element is a ost.seq.AlignmentList aligning these mdl chain sequences to the chem group ref sequences. 3) A selection of model that only contains polypeptides and polynucleotides whose ATOMSEQ exactly matches the sequence info in the returned alignments.

GetlDDTMapping(model, inclusion_radius=15.0, thresholds=[0.5, 1.0, 2.0, 4.0], strategy='heuristic', steep_opt_rate=None, block_seed_size=5, block_blocks_per_chem_group=5, chem_mapping_result=None, heuristic_n_max_naive=40320)

Identify chain mapping by optimizing lDDT score

Maps model chain sequences to chem_groups and find mapping based on backbone only lDDT score (CA for amino acids C3’ for Nucleotides).

Either performs a naive search, i.e. enumerate all possible mappings or executes a greedy strategy that tries to identify a (close to) optimal mapping in an iterative way by starting from a start mapping (seed). In each iteration, the one-to-one mapping that leads to highest increase in number of conserved contacts is added with the additional requirement that this added mapping must have non-zero interface counts towards the already mapped chains. So basically we’re “growing” the mapped structure by only adding connected stuff.

The available strategies:

  • naive: Enumerates all possible mappings and returns best

  • greedy_fast: perform all vs. all single chain lDDTs within the respective ref/mdl chem groups. The mapping with highest number of conserved contacts is selected as seed for greedy extension

  • greedy_full: try multiple seeds for greedy extension, i.e. try all ref/mdl chain combinations within the respective chem groups and retain the mapping leading to the best lDDT.

  • greedy_block: try multiple seeds for greedy extension, i.e. try all ref/mdl chain combinations within the respective chem groups and extend them to block_seed_size. block_blocks_per_chem_group for each chem group are selected for exhaustive extension.

  • heuristic: Uses naive strategy if number of possible mappings is within heuristic_n_max_naive. The default of 40320 corresponds to an octamer (8!=40320). A structure with stoichiometry A6B2 would be 6!*2!=1440 etc. If the number of possible mappings is larger, greedy_full is used.

Sets MappingResult.opt_score in case of no trivial one-to-one mapping.

Parameters:
  • model (ost.mol.EntityView/ost.mol.EntityHandle) – Model to map

  • inclusion_radius (float) – Inclusion radius for lDDT

  • thresholds (list of float) – Thresholds for lDDT

  • strategy (str) – Strategy to find mapping. Must be in [“naive”, “greedy_fast”, “greedy_full”, “greedy_block”]

  • steep_opt_rate (int) – Only relevant for greedy strategies. If set, every steep_opt_rate mappings, a simple optimization is executed with the goal of avoiding local minima. The optimization iteratively checks all possible swaps of mappings within their respective chem groups and accepts swaps that improve lDDT score. Iteration stops as soon as no improvement can be achieved anymore.

  • block_seed_size (int) – Param for greedy_block strategy - Initial seeds are extended by that number of chains.

  • block_blocks_per_chem_group (int) – Param for greedy_block strategy - Number of blocks per chem group that are extended in an initial search for high scoring local solutions.

  • chem_mapping_result (tuple) – Pro param. The result of GetChemMapping() where you provided model. If set, model parameter is not used.

Returns:

A MappingResult

GetQSScoreMapping(model, contact_d=12.0, strategy='heuristic', block_seed_size=5, block_blocks_per_chem_group=5, heuristic_n_max_naive=40320, steep_opt_rate=None, chem_mapping_result=None, greedy_prune_contact_map=True)

Identify chain mapping based on QSScore

Scoring is based on CA/C3’ positions which are present in all chains of a chem_groups as well as the model chains which are mapped to that respective chem group.

The following strategies are available:

  • naive: Naively iterate all possible mappings and return best based

    on QS score.

  • greedy_fast: perform all vs. all single chain lDDTs within the respective ref/mdl chem groups. The mapping with highest number of conserved contacts is selected as seed for greedy extension. Extension is based on QS-score.

  • greedy_full: try multiple seeds for greedy extension, i.e. try all ref/mdl chain combinations within the respective chem groups and retain the mapping leading to the best QS-score.

  • greedy_block: try multiple seeds for greedy extension, i.e. try all ref/mdl chain combinations within the respective chem groups and extend them to block_seed_size. block_blocks_per_chem_group for each chem group are selected for exhaustive extension.

  • heuristic: Uses naive strategy if number of possible mappings is within heuristic_n_max_naive. The default of 40320 corresponds to an octamer (8!=40320). A structure with stoichiometry A6B2 would be 6!*2!=1440 etc. If the number of possible mappings is larger, greedy_full is used.

Sets MappingResult.opt_score in case of no trivial one-to-one mapping.

Parameters:
  • model (ost.mol.EntityView/ost.mol.EntityHandle) – Model to map

  • contact_d (float) – Max distance between two residues to be considered as contact in qs scoring

  • strategy (str) – Strategy for sampling, must be in [“naive”, “greedy_fast”, “greedy_full”, “greedy_block”]

  • chem_mapping_result (tuple) – Pro param. The result of GetChemMapping() where you provided model. If set, model parameter is not used.

  • greedy_prune_contact_map – Relevant for all strategies that use greedy extensions. If True, only chains with at least 3 contacts (8A CB distance) towards already mapped chains in trg/mdl are considered for extension. All chains that give a potential non-zero QS-score increase are used otherwise (at least one contact within 12A). The consequence is reduced runtime and usually no real reduction in accuracy.

Returns:

A MappingResult

GetRMSDMapping(model, strategy='heuristic', subsampling=50, chem_mapping_result=None, heuristic_n_max_naive=120)

Identify chain mapping based on minimal RMSD superposition

Superposition and scoring is based on CA/C3’ positions which are present in all chains of a chem_groups as well as the model chains which are mapped to that respective chem group.

The following strategies are available:

  • naive: Naively iterate all possible mappings and return the one with lowes RMSD.

  • greedy_single: perform all vs. all single chain superpositions within the respective ref/mdl chem groups to use as starting points. For each starting point, iteratively add the model/target chain pair with lowest RMSD until a full mapping is achieved. The mapping with lowest RMSD is returned.

  • greedy_iterative: Same as greedy_single_rmsd exept that the transformation gets updated with each added chain pair.

  • heuristic: Uses naive strategy if number of possible mappings is within heuristic_n_max_naive. The default of 120 corresponds to a homo-pentamer (5!=120). If the number of possible mappings is larger, greedy_iterative is used.

Parameters:
  • model (ost.mol.EntityView/ost.mol.EntityHandle) – Model to map

  • strategy (str) – Strategy for sampling. Must be in [“naive”, “greedy_single”, “greedy_iterative”]

  • subsampling (int) – If given, only an equally distributed subset of CA/C3’ positions in each chain are used for superposition/scoring.

  • chem_mapping_result (tuple) – Pro param. The result of GetChemMapping() where you provided model. If set, model parameter is not used.

Returns:

A MappingResult

GetMapping(model, n_max_naive=40320)

Convenience function to get mapping with currently preferred method

If number of possible chain mappings is <= n_max_naive, a naive QS-score mapping is performed and optimal QS-score is guaranteed. For anything else, a QS-score mapping with the greedy_full strategy is performed (greedy_prune_contact_map = True). The default for n_max_naive of 40320 corresponds to an octamer (8!=40320). A structure with stoichiometry A6B2 would be 6!*2!=1440 etc.

If target has nucleotide sequences, the QS-score target function is replaced with a backbone only lDDT score that has an inclusion radius of 30A.

GetRepr(substructure, model, topn=1, inclusion_radius=15.0, thresholds=[0.5, 1.0, 2.0, 4.0], bb_only=False, only_interchain=False, chem_mapping_result=None, global_mapping=None)

Identify topn representations of substructure in model

substructure defines a subset of target for which one wants the topn representations in model. Representations are scored and sorted by lDDT.

Parameters:
  • substructure (ost.mol.EntityView) – A ost.mol.EntityView which is a subset of target. Should be selected with the OpenStructure query language. Example: if you’re interested in residues with number 42,43 and 85 in chain A: substructure=mapper.target.Select("cname=A and rnum=42,43,85") A RuntimeError is raised if substructure does not refer to the same underlying ost.mol.EntityHandle as target.

  • model (ost.mol.EntityView/ost.mol.EntityHandle) – Structure in which one wants to find representations for substructure

  • topn (int) – Max number of representations that are returned

  • inclusion_radius (float) – Inclusion radius for lDDT

  • thresholds (list of float) – Thresholds for lDDT

  • bb_only (bool) – Only consider backbone atoms in lDDT computation

  • only_interchain (bool) – Only score interchain contacts in lDDT. Useful if you want to identify interface patches.

  • chem_mapping_result (tuple) – Pro param. The result of GetChemMapping() where you provided model. If set, model parameter is not used.

  • global_mapping (MappingResult) – Pro param. Specify a global mapping result. This fully defines the desired representation in the model but extracts it and enriches it with all the nice attributes of ReprResult. The target attribute in global_mapping must be of the same entity as self.target and the model attribute of global_mapping must be of the same entity as model.

Returns:

list of ReprResult

GetNMappings(model)

Returns number of possible mappings

Parameters:

model (ost.mol.EntityView/ost.mol.EntityHandle) – Model with chains that are mapped onto chem_groups

ProcessStructure(ent)

Entity processing for chain mapping

  • Selects view containing peptide and nucleotide residues which have required backbone atoms present - for peptide residues thats N, CA, C and CB (no CB for GLY), for nucleotide residues thats O5’, C5’, C4’, C3’ and O3’.

  • filters view by chain lengths, see min_pep_length and min_nuc_length in constructor

  • Extracts atom sequences for each chain in that view

  • Attaches corresponding ost.mol.EntityView to each sequence

  • If residue number alignments are used, strictly increasing residue numbers without insertion codes are ensured in each chain

Parameters:

ent (ost.mol.EntityView/ost.mol.EntityHandle) – Entity to process

Returns:

Tuple with 3 elements: 1) ost.mol.EntityView containing peptide and nucleotide residues 2) ost.seq.SequenceList containing ATOMSEQ sequences for each polypeptide chain in returned view, sequences have ost.mol.EntityView of according chains attached 3) same for polynucleotide chains

Align(s1, s2, stype)

Access to internal sequence alignment functionality

Alignment parameterization is setup at ChainMapper construction

Parameters:
  • s1 (ost.seq.SequenceHandle) – First sequence to align - must have view attached in case of resnum_alignments

  • s2 (ost.seq.SequenceHandle) – Second sequence to align - must have view attached in case of resnum_alignments

  • stype – Type of sequences to align, must be in [ost.mol.ChemType.AMINOACIDS, ost.mol.ChemType.NUCLEOTIDES]

Returns:

Pairwise alignment of s1 and s2

class ReprResult(lDDT, substructure, ref_view, mdl_view)

Result object for ChainMapper.GetRepr()

Constructor is directly called within the function, no need to construct such objects yourself.

Parameters:
  • lDDT (float) – lDDT for this mapping. Depends on how you call ChainMapper.GetRepr() whether this is backbone only or full atom lDDT.

  • substructure (ost.mol.EntityView) – The full substructure for which we searched for a representation

  • ref_view (mol.EntityView) – View pointing to the same underlying entity as substructure but only contains the stuff that is mapped

  • mdl_view (mol.EntityView) – The matching counterpart in model

property lDDT

lDDT of representation result

Depends on how you call ChainMapper.GetRepr() whether this is backbone only or full atom lDDT.

Type:

float

property substructure

The full substructure for which we searched for a representation

Type:

ost.mol.EntityView

property ref_view

View which contains the mapped subset of substructure

Type:

ost.mol.EntityView

property mdl_view

The ref_view representation in the model

Type:

ost.mol.EntityView

property ref_residues

The reference residues

Type:

class:mol.ResidueViewList

property mdl_residues

The model residues

Type:

mol.ResidueViewList

property inconsistent_residues

A list of mapped residue whose names do not match (eg. ALA in the reference and LEU in the model).

The mismatches are reported as a tuple of ResidueView (reference, model), or as an empty list if all the residue names match.

Type:

list

property ref_bb_pos

Representative backbone positions for reference residues.

Thats CA positions for peptides and C3’ positions for Nucleotides.

Type:

geom.Vec3List

property mdl_bb_pos

Representative backbone positions for model residues.

Thats CA positions for peptides and C3’ positions for Nucleotides.

Type:

geom.Vec3List

property ref_full_bb_pos

Representative backbone positions for reference residues.

Thats N, CA and C positions for peptides and O5’, C5’, C4’, C3’, O3’ positions for Nucleotides.

Type:

geom.Vec3List

property mdl_full_bb_pos

Representative backbone positions for reference residues.

Thats N, CA and C positions for peptides and O5’, C5’, C4’, C3’, O3’ positions for Nucleotides.

Type:

geom.Vec3List

property transform

Transformation to superpose mdl residues onto ref residues

Superposition computed as minimal RMSD superposition on ref_bb_pos and mdl_bb_pos. If number of positions is smaller 3, the full_bb_pos equivalents are used instead.

Type:

ost.geom.Mat4

property superposed_mdl_bb_pos

mdl_bb_pos with transform applied

Type:

geom.Vec3List

property bb_rmsd

RMSD between ref_bb_pos and superposed_mdl_bb_pos

Type:

float

property gdt_8

GDT with one single threshold: 8.0

Type:

float

property gdt_4

GDT with one single threshold: 4.0

Type:

float

property gdt_2

GDT with one single threshold: 2.0

Type:

float

property gdt_1

GDT with one single threshold: 1.0

Type:

float

property ost_query

query for mdl residues in OpenStructure query language

Repr can be selected as full_mdl.Select(ost_query)

Returns invalid query if residue numbers have insertion codes.

Type:

str

JSONSummary()

Returns JSON serializable summary of results

GetFlatChainMapping(mdl_as_key=False)

Returns flat mapping of all chains in the representation

Parameters:

mdl_as_key – Default is target chain name as key and model chain name as value. This can be reversed with this flag.

Returns:

dict with str as key/value that describe one-to-one mapping

class MappingResult(target, model, chem_groups, chem_mapping, mapping, alns, opt_score=None)

Result object for the chain mapping functions in ChainMapper

Constructor is directly called within the functions, no need to construct such objects yourself.

property target

Target/reference structure, i.e. ChainMapper.target

Type:

ost.mol.EntityView

property model

Model structure that gets mapped onto target

Underwent same processing as ChainMapper.target, i.e. only contains peptide/nucleotide chains of sufficient size.

Type:

ost.mol.EntityView

property chem_groups

Groups of chemically equivalent chains in target

Same as ChainMapper.chem_group

list of list of str (chain names)

property chem_mapping

Assigns chains in model to chem_groups.

list of list of str (chain names)

property mapping

Mapping of model chains onto target

Exact same shape as chem_groups but containing the names of the mapped chains in model. May contain None for target chains that are not covered. No guarantee that all chains in model are mapped.

list of list of str (chain names)

property alns

Alignments of mapped chains in target and model

Each alignment is accessible with alns[(t_chain,m_chain)]. First sequence is the sequence of target chain, second sequence the one from model. The respective ost.mol.EntityView are attached with ost.seq.ConstSequenceHandle.AttachView().

Type:

dict with key: tuple of str, value: ost.seq.AlignmentHandle

property opt_score

Placeholder property without any guarantee of being set

Different scores get optimized in the various chain mapping algorithms. Some of them may set their final optimal score in that property. Consult the documentation of the respective chain mapping algorithm for more information. Won’t be in the return dict of JSONSummary().

GetFlatMapping(mdl_as_key=False)

Returns flat mapping as dict for all mapable chains

Parameters:

mdl_as_key – Default is target chain name as key and model chain name as value. This can be reversed with this flag.

Returns:

dict with str as key/value that describe one-to-one mapping

JSONSummary()

Returns JSON serializable summary of results

Search

Enter search terms or a module, class or function name.

Contents

Documentation is available for the following OpenStructure versions:

(Currently viewing dev) / 2.8 / 2.7 / 2.6 / 2.5 / 2.4 / 2.3.1 / 2.3 / 2.2 / 2.1 / 2.0 / 1.9 / 1.8 / 1.7.1 / 1.7 / 1.6 / 1.5 / 1.4 / 1.3 / 1.2 / 1.11 / 1.10 / 1.1

This documentation is still under heavy development!
If something is missing or if you need the C++ API description in doxygen style, check our old documentation for further information.