chain_mapping
– Chain Mapping¶
Chain mapping aims to identify a one-to-one relationship between chains in a reference structure and a model.
- class ChainMapper(target, resnum_alignments=False, pep_seqid_thr=95.0, pep_gap_thr=1.0, nuc_seqid_thr=95.0, nuc_gap_thr=1.0, pep_subst_mat=<ost.seq.alg._ost_seq_alg.SubstWeightMatrix object>, pep_gap_open=-11, pep_gap_ext=-1, nuc_subst_mat=<ost.seq.alg._ost_seq_alg.SubstWeightMatrix object>, nuc_gap_open=-4, nuc_gap_ext=-4, min_pep_length=6, min_nuc_length=4, n_max_naive=100000000.0)¶
Class to compute chain mappings
All algorithms are performed on processed structures which fulfill criteria as given in constructor arguments (min_pep_length, “min_nuc_length”) and only contain residues which have all required backbone atoms. for peptide residues thats N, CA, C and CB (no CB for GLY), for nucleotide residues thats O5’, C5’, C4’, C3’ and O3’.
Chain mapping is a three step process:
Group chemically identical chains in target using pairwise alignments that are either computed with Needleman-Wunsch (NW) or simply derived from residue numbers (resnum_alignments flag). In case of NW, pep_subst_mat, pep_gap_open and pep_gap_ext and their nucleotide equivalents are relevant. Two chains are considered identical if they fulfill the thresholds given by pep_seqid_thr, pep_gap_thr, their nucleotide equivalents respectively. The grouping information is available as attributes of this class.
Map chains in an input model to these groups. Generating alignments and the similarity criteria are the same as above. You can either get the group mapping with
GetChemMapping()
or directly call one of the full fletched one-to-one chain mapping functions which execute that step internally.Obtain one-to-one mapping for chains in an input model and target with one of the available mapping functions. Just to get an idea of complexity. If target and model are octamers, there are
8! = 40320
possible chain mappings.
- Parameters:
target (
ost.mol.EntityView
/ost.mol.EntityHandle
) – Target structure onto which models are mapped. Computations happen on a selection only containing polypeptides and polynucleotides.resnum_alignments (
bool
) – Use residue numbers instead of Needleman-Wunsch to compute pairwise alignments. Relevant forchem_groups
and related attributes.pep_seqid_thr (
float
) – Threshold used to decide when two chains are identical. 95 percent tolerates the few mutations crystallographers like to do.pep_gap_thr (
float
) – Additional threshold to avoid gappy alignments with high seqid. By default this is disabled (set to 1.0). This threshold checks for a maximum allowed fraction of gaps in any of the two sequences after stripping terminal gaps. The reason for not just normalizing seqid by the longer sequence is that one sequence might be a perfect subsequence of the other but only cover half of it.nuc_seqid_thr (
float
) – Nucleotide equivalent for pep_seqid_thrnuc_gap_thr (
float
) – Nucleotide equivalent for nuc_gap_thrpep_subst_mat (
ost.seq.alg.SubstWeightMatrix
) – Substitution matrix to align peptide sequences, irrelevant if resnum_alignments is True, defaults to seq.alg.BLOSUM62pep_gap_open (
int
) – Gap open penalty to align peptide sequences, irrelevant if resnum_alignments is Truepep_gap_ext (
int
) – Gap extension penalty to align peptide sequences, irrelevant if resnum_alignments is Truenuc_subst_mat (
ost.seq.alg.SubstWeightMatrix
) – Nucleotide equivalent for pep_subst_mat, defaults to seq.alg.NUC44nuc_gap_open (
int
) – Nucleotide equivalent for pep_gap_opennuc_gap_ext (
int
) – Nucleotide equivalent for pep_gap_extmin_pep_length (
int
) – Minimal number of residues for a peptide chain to be considered in target and in models.min_nuc_length (
int
) – Minimal number of residues for a nucleotide chain to be considered in target and in models.n_max_naive (
int
) – Max possible chain mappings that are enumerated inGetNaivelDDTMapping()
/GetDecomposerlDDTMapping()
. ARuntimeError
is raised in case of bigger complexity.
- property target¶
Target structure that only contains peptides/nucleotides
Contains only residues that have the backbone representatives (CA for peptide and C3’ for nucleotides) to avoid ATOMSEQ alignment inconsistencies when switching between all atom and backbone only representations.
- Type:
- property polypep_seqs¶
Sequences of peptide chains in
target
Respective
EntityView
from target for each sequence s are available ass.GetAttachedView()
- Type:
- property polynuc_seqs¶
Sequences of nucleotide chains in
target
Respective
EntityView
from target for each sequence s are available ass.GetAttachedView()
- Type:
- property chem_groups¶
Groups of chemically equivalent chains in
target
First chain in group is the one with longest sequence.
- Getter:
Computed on first use (cached)
- Type:
list
oflist
ofstr
(chain names)
- property chem_group_alignments¶
MSA for each group in
chem_groups
Sequences in MSAs exhibit same order as in
chem_groups
and have the respectiveost.mol.EntityView
from target attached.- Getter:
Computed on first use (cached)
- Type:
ost.seq.AlignmentList
- property chem_group_ref_seqs¶
Reference (longest) sequence for each group in
chem_groups
Respective
EntityView
from target for each sequence s are available ass.GetAttachedView()
- Getter:
Computed on first use (cached)
- Type:
- property chem_group_types¶
ChemType of each group in
chem_groups
Specifying if groups are poly-peptides/nucleotides, i.e.
ost.mol.ChemType.AMINOACIDS
orost.mol.ChemType.NUCLEOTIDES
- Getter:
Computed on first use (cached)
- Type:
list
ofost.mol.ChemType
- GetChemMapping(model)¶
Maps sequences in model to chem_groups of target
- Parameters:
model (
ost.mol.EntityView
/ost.mol.EntityHandle
) – Model from which to extract sequences, a selection that only includes peptides and nucleotides is performed and returned along other results.- Returns:
Tuple with two lists of length len(self.chem_groups) and an
ost.mol.EntityView
representing model: 1) Each element is alist
with mdl chain names that map to the chem group at that position. 2) Each element is aost.seq.AlignmentList
aligning these mdl chain sequences to the chem group ref sequences. 3) A selection of model that only contains polypeptides and polynucleotides whose ATOMSEQ exactly matches the sequence info in the returned alignments.
- GetlDDTMapping(model, inclusion_radius=15.0, thresholds=[0.5, 1.0, 2.0, 4.0], strategy='heuristic', steep_opt_rate=None, block_seed_size=5, block_blocks_per_chem_group=5, chem_mapping_result=None, heuristic_n_max_naive=40320)¶
Identify chain mapping by optimizing lDDT score
Maps model chain sequences to
chem_groups
and find mapping based on backbone only lDDT score (CA for amino acids C3’ for Nucleotides).Either performs a naive search, i.e. enumerate all possible mappings or executes a greedy strategy that tries to identify a (close to) optimal mapping in an iterative way by starting from a start mapping (seed). In each iteration, the one-to-one mapping that leads to highest increase in number of conserved contacts is added with the additional requirement that this added mapping must have non-zero interface counts towards the already mapped chains. So basically we’re “growing” the mapped structure by only adding connected stuff.
The available strategies:
naive: Enumerates all possible mappings and returns best
greedy_fast: perform all vs. all single chain lDDTs within the respective ref/mdl chem groups. The mapping with highest number of conserved contacts is selected as seed for greedy extension
greedy_full: try multiple seeds for greedy extension, i.e. try all ref/mdl chain combinations within the respective chem groups and retain the mapping leading to the best lDDT.
greedy_block: try multiple seeds for greedy extension, i.e. try all ref/mdl chain combinations within the respective chem groups and extend them to block_seed_size. block_blocks_per_chem_group for each chem group are selected for exhaustive extension.
heuristic: Uses naive strategy if number of possible mappings is within heuristic_n_max_naive. The default of 40320 corresponds to an octamer (8!=40320). A structure with stoichiometry A6B2 would be 6!*2!=1440 etc. If the number of possible mappings is larger, greedy_full is used.
Sets
MappingResult.opt_score
in case of no trivial one-to-one mapping.- Parameters:
model (
ost.mol.EntityView
/ost.mol.EntityHandle
) – Model to mapinclusion_radius (
float
) – Inclusion radius for lDDTthresholds (
list
offloat
) – Thresholds for lDDTstrategy (
str
) – Strategy to find mapping. Must be in [“naive”, “greedy_fast”, “greedy_full”, “greedy_block”]steep_opt_rate (
int
) – Only relevant for greedy strategies. If set, every steep_opt_rate mappings, a simple optimization is executed with the goal of avoiding local minima. The optimization iteratively checks all possible swaps of mappings within their respective chem groups and accepts swaps that improve lDDT score. Iteration stops as soon as no improvement can be achieved anymore.block_seed_size (
int
) – Param for greedy_block strategy - Initial seeds are extended by that number of chains.block_blocks_per_chem_group (
int
) – Param for greedy_block strategy - Number of blocks per chem group that are extended in an initial search for high scoring local solutions.chem_mapping_result (
tuple
) – Pro param. The result ofGetChemMapping()
where you provided model. If set, model parameter is not used.
- Returns:
- GetQSScoreMapping(model, contact_d=12.0, strategy='heuristic', block_seed_size=5, block_blocks_per_chem_group=5, heuristic_n_max_naive=40320, steep_opt_rate=None, chem_mapping_result=None, greedy_prune_contact_map=True)¶
Identify chain mapping based on QSScore
Scoring is based on CA/C3’ positions which are present in all chains of a
chem_groups
as well as the model chains which are mapped to that respective chem group.The following strategies are available:
- naive: Naively iterate all possible mappings and return best based
on QS score.
greedy_fast: perform all vs. all single chain lDDTs within the respective ref/mdl chem groups. The mapping with highest number of conserved contacts is selected as seed for greedy extension. Extension is based on QS-score.
greedy_full: try multiple seeds for greedy extension, i.e. try all ref/mdl chain combinations within the respective chem groups and retain the mapping leading to the best QS-score.
greedy_block: try multiple seeds for greedy extension, i.e. try all ref/mdl chain combinations within the respective chem groups and extend them to block_seed_size. block_blocks_per_chem_group for each chem group are selected for exhaustive extension.
heuristic: Uses naive strategy if number of possible mappings is within heuristic_n_max_naive. The default of 40320 corresponds to an octamer (8!=40320). A structure with stoichiometry A6B2 would be 6!*2!=1440 etc. If the number of possible mappings is larger, greedy_full is used.
Sets
MappingResult.opt_score
in case of no trivial one-to-one mapping.- Parameters:
model (
ost.mol.EntityView
/ost.mol.EntityHandle
) – Model to mapcontact_d (
float
) – Max distance between two residues to be considered as contact in qs scoringstrategy (
str
) – Strategy for sampling, must be in [“naive”, “greedy_fast”, “greedy_full”, “greedy_block”]chem_mapping_result (
tuple
) – Pro param. The result ofGetChemMapping()
where you provided model. If set, model parameter is not used.greedy_prune_contact_map – Relevant for all strategies that use greedy extensions. If True, only chains with at least 3 contacts (8A CB distance) towards already mapped chains in trg/mdl are considered for extension. All chains that give a potential non-zero QS-score increase are used otherwise (at least one contact within 12A). The consequence is reduced runtime and usually no real reduction in accuracy.
- Returns:
- GetRMSDMapping(model, strategy='heuristic', subsampling=50, chem_mapping_result=None, heuristic_n_max_naive=120)¶
Identify chain mapping based on minimal RMSD superposition
Superposition and scoring is based on CA/C3’ positions which are present in all chains of a
chem_groups
as well as the model chains which are mapped to that respective chem group.The following strategies are available:
naive: Naively iterate all possible mappings and return the one with lowes RMSD.
greedy_single: perform all vs. all single chain superpositions within the respective ref/mdl chem groups to use as starting points. For each starting point, iteratively add the model/target chain pair with lowest RMSD until a full mapping is achieved. The mapping with lowest RMSD is returned.
greedy_iterative: Same as greedy_single_rmsd exept that the transformation gets updated with each added chain pair.
heuristic: Uses naive strategy if number of possible mappings is within heuristic_n_max_naive. The default of 120 corresponds to a homo-pentamer (5!=120). If the number of possible mappings is larger, greedy_iterative is used.
- Parameters:
model (
ost.mol.EntityView
/ost.mol.EntityHandle
) – Model to mapstrategy (
str
) – Strategy for sampling. Must be in [“naive”, “greedy_single”, “greedy_iterative”]subsampling (
int
) – If given, only an equally distributed subset of CA/C3’ positions in each chain are used for superposition/scoring.chem_mapping_result (
tuple
) – Pro param. The result ofGetChemMapping()
where you provided model. If set, model parameter is not used.
- Returns:
- GetMapping(model, n_max_naive=40320)¶
Convenience function to get mapping with currently preferred method
If number of possible chain mappings is <= n_max_naive, a naive QS-score mapping is performed and optimal QS-score is guaranteed. For anything else, a QS-score mapping with the greedy_full strategy is performed (greedy_prune_contact_map = True). The default for n_max_naive of 40320 corresponds to an octamer (8!=40320). A structure with stoichiometry A6B2 would be 6!*2!=1440 etc.
If
target
has nucleotide sequences, the QS-score target function is replaced with a backbone only lDDT score that has an inclusion radius of 30A.
- GetRepr(substructure, model, topn=1, inclusion_radius=15.0, thresholds=[0.5, 1.0, 2.0, 4.0], bb_only=False, only_interchain=False, chem_mapping_result=None, global_mapping=None)¶
Identify topn representations of substructure in model
substructure defines a subset of
target
for which one wants the topn representations in model. Representations are scored and sorted by lDDT.- Parameters:
substructure (
ost.mol.EntityView
) – Aost.mol.EntityView
which is a subset oftarget
. Should be selected with the OpenStructure query language. Example: if you’re interested in residues with number 42,43 and 85 in chain A:substructure=mapper.target.Select("cname=A and rnum=42,43,85")
ARuntimeError
is raised if substructure does not refer to the same underlyingost.mol.EntityHandle
astarget
.model (
ost.mol.EntityView
/ost.mol.EntityHandle
) – Structure in which one wants to find representations for substructuretopn (
int
) – Max number of representations that are returnedinclusion_radius (
float
) – Inclusion radius for lDDTthresholds (
list
offloat
) – Thresholds for lDDTbb_only (
bool
) – Only consider backbone atoms in lDDT computationonly_interchain (
bool
) – Only score interchain contacts in lDDT. Useful if you want to identify interface patches.chem_mapping_result (
tuple
) – Pro param. The result ofGetChemMapping()
where you provided model. If set, model parameter is not used.global_mapping (
MappingResult
) – Pro param. Specify a global mapping result. This fully defines the desired representation in the model but extracts it and enriches it with all the nice attributes ofReprResult
. The target attribute in global_mapping must be of the same entity as self.target and the model attribute of global_mapping must be of the same entity as model.
- Returns:
list
ofReprResult
- GetNMappings(model)¶
Returns number of possible mappings
- Parameters:
model (
ost.mol.EntityView
/ost.mol.EntityHandle
) – Model with chains that are mapped ontochem_groups
- ProcessStructure(ent)¶
Entity processing for chain mapping
Selects view containing peptide and nucleotide residues which have required backbone atoms present - for peptide residues thats N, CA, C and CB (no CB for GLY), for nucleotide residues thats O5’, C5’, C4’, C3’ and O3’.
filters view by chain lengths, see min_pep_length and min_nuc_length in constructor
Extracts atom sequences for each chain in that view
Attaches corresponding
ost.mol.EntityView
to each sequenceIf residue number alignments are used, strictly increasing residue numbers without insertion codes are ensured in each chain
- Parameters:
ent (
ost.mol.EntityView
/ost.mol.EntityHandle
) – Entity to process- Returns:
Tuple with 3 elements: 1)
ost.mol.EntityView
containing peptide and nucleotide residues 2)ost.seq.SequenceList
containing ATOMSEQ sequences for each polypeptide chain in returned view, sequences haveost.mol.EntityView
of according chains attached 3) same for polynucleotide chains
- Align(s1, s2, stype)¶
Access to internal sequence alignment functionality
Alignment parameterization is setup at ChainMapper construction
- Parameters:
s1 (
ost.seq.SequenceHandle
) – First sequence to align - must have view attached in case of resnum_alignmentss2 (
ost.seq.SequenceHandle
) – Second sequence to align - must have view attached in case of resnum_alignmentsstype – Type of sequences to align, must be in [
ost.mol.ChemType.AMINOACIDS
,ost.mol.ChemType.NUCLEOTIDES
]
- Returns:
Pairwise alignment of s1 and s2
- class ReprResult(lDDT, substructure, ref_view, mdl_view)¶
Result object for
ChainMapper.GetRepr()
Constructor is directly called within the function, no need to construct such objects yourself.
- Parameters:
lDDT (
float
) – lDDT for this mapping. Depends on how you callChainMapper.GetRepr()
whether this is backbone only or full atom lDDT.substructure (
ost.mol.EntityView
) – The full substructure for which we searched for a representationref_view (
mol.EntityView
) – View pointing to the same underlying entity as substructure but only contains the stuff that is mappedmdl_view (
mol.EntityView
) – The matching counterpart in model
- property lDDT¶
lDDT of representation result
Depends on how you call
ChainMapper.GetRepr()
whether this is backbone only or full atom lDDT.- Type:
float
- property substructure¶
The full substructure for which we searched for a representation
- Type:
- property ref_view¶
View which contains the mapped subset of
substructure
- Type:
- property ref_residues¶
The reference residues
- Type:
class:mol.ResidueViewList
- property mdl_residues¶
The model residues
- Type:
mol.ResidueViewList
- property inconsistent_residues¶
A list of mapped residue whose names do not match (eg. ALA in the reference and LEU in the model).
The mismatches are reported as a tuple of
ResidueView
(reference, model), or as an empty list if all the residue names match.- Type:
list
- property ref_bb_pos¶
Representative backbone positions for reference residues.
Thats CA positions for peptides and C3’ positions for Nucleotides.
- Type:
geom.Vec3List
- property mdl_bb_pos¶
Representative backbone positions for model residues.
Thats CA positions for peptides and C3’ positions for Nucleotides.
- Type:
geom.Vec3List
- property ref_full_bb_pos¶
Representative backbone positions for reference residues.
Thats N, CA and C positions for peptides and O5’, C5’, C4’, C3’, O3’ positions for Nucleotides.
- Type:
geom.Vec3List
- property mdl_full_bb_pos¶
Representative backbone positions for reference residues.
Thats N, CA and C positions for peptides and O5’, C5’, C4’, C3’, O3’ positions for Nucleotides.
- Type:
geom.Vec3List
- property transform¶
Transformation to superpose mdl residues onto ref residues
Superposition computed as minimal RMSD superposition on
ref_bb_pos
andmdl_bb_pos
. If number of positions is smaller 3, the full_bb_pos equivalents are used instead.- Type:
- property superposed_mdl_bb_pos¶
mdl_bb_pos
withtransform applied
- Type:
geom.Vec3List
- property bb_rmsd¶
RMSD between
ref_bb_pos
andsuperposed_mdl_bb_pos
- Type:
float
- property gdt_8¶
GDT with one single threshold: 8.0
- Type:
float
- property gdt_4¶
GDT with one single threshold: 4.0
- Type:
float
- property gdt_2¶
GDT with one single threshold: 2.0
- Type:
float
- property gdt_1¶
GDT with one single threshold: 1.0
- Type:
float
- property ost_query¶
query for mdl residues in OpenStructure query language
Repr can be selected as
full_mdl.Select(ost_query)
Returns invalid query if residue numbers have insertion codes.
- Type:
str
- JSONSummary()¶
Returns JSON serializable summary of results
- GetFlatChainMapping(mdl_as_key=False)¶
Returns flat mapping of all chains in the representation
- Parameters:
mdl_as_key – Default is target chain name as key and model chain name as value. This can be reversed with this flag.
- Returns:
dict
withstr
as key/value that describe one-to-one mapping
- class MappingResult(target, model, chem_groups, chem_mapping, mapping, alns, opt_score=None)¶
Result object for the chain mapping functions in
ChainMapper
Constructor is directly called within the functions, no need to construct such objects yourself.
- property target¶
Target/reference structure, i.e.
ChainMapper.target
- Type:
- property model¶
Model structure that gets mapped onto
target
Underwent same processing as
ChainMapper.target
, i.e. only contains peptide/nucleotide chains of sufficient size.- Type:
- property chem_groups¶
Groups of chemically equivalent chains in
target
Same as
ChainMapper.chem_group
list
oflist
ofstr
(chain names)
- property chem_mapping¶
Assigns chains in
model
tochem_groups
.list
oflist
ofstr
(chain names)
- property mapping¶
Mapping of
model
chains ontotarget
Exact same shape as
chem_groups
but containing the names of the mapped chains inmodel
. May contain None fortarget
chains that are not covered. No guarantee that all chains inmodel
are mapped.list
oflist
ofstr
(chain names)
- property alns¶
Alignments of mapped chains in
target
andmodel
Each alignment is accessible with
alns[(t_chain,m_chain)]
. First sequence is the sequence oftarget
chain, second sequence the one frommodel
. The respectiveost.mol.EntityView
are attached withost.seq.ConstSequenceHandle.AttachView()
.- Type:
dict
with key:tuple
ofstr
, value:ost.seq.AlignmentHandle
- property opt_score¶
Placeholder property without any guarantee of being set
Different scores get optimized in the various chain mapping algorithms. Some of them may set their final optimal score in that property. Consult the documentation of the respective chain mapping algorithm for more information. Won’t be in the return dict of
JSONSummary()
.
- GetFlatMapping(mdl_as_key=False)¶
Returns flat mapping as
dict
for all mapable chains- Parameters:
mdl_as_key – Default is target chain name as key and model chain name as value. This can be reversed with this flag.
- Returns:
dict
withstr
as key/value that describe one-to-one mapping
- JSONSummary()¶
Returns JSON serializable summary of results