|
| | __init__ (self, target, resnum_alignments=False, pep_seqid_thr=95., nuc_seqid_thr=95., pep_subst_mat=seq.alg.BLOSUM62, pep_gap_open=-11, pep_gap_ext=-1, nuc_subst_mat=seq.alg.NUC44, nuc_gap_open=-4, nuc_gap_ext=-4, min_pep_length=6, min_nuc_length=4, n_max_naive=1e8, mdl_map_pep_seqid_thr=0., mdl_map_nuc_seqid_thr=0., seqres=None, trg_seqres_mapping=None) |
| |
| | target (self) |
| |
| | polypep_seqs (self) |
| |
| | polynuc_seqs (self) |
| |
| | chem_groups (self) |
| |
| | chem_group_alignments (self) |
| |
| | chem_group_ref_seqs (self) |
| |
| | chem_group_types (self) |
| |
| | GetChemMapping (self, model) |
| |
| | GetlDDTMapping (self, model, inclusion_radius=15.0, thresholds=[0.5, 1.0, 2.0, 4.0], strategy="heuristic", steep_opt_rate=None, block_seed_size=5, block_blocks_per_chem_group=5, chem_mapping_result=None, heuristic_n_max_naive=40320) |
| |
| | GetQSScoreMapping (self, model, contact_d=12.0, strategy="heuristic", block_seed_size=5, block_blocks_per_chem_group=5, heuristic_n_max_naive=40320, steep_opt_rate=None, chem_mapping_result=None, greedy_prune_contact_map=True) |
| |
| | GetRMSDMapping (self, model, strategy="heuristic", subsampling=50, chem_mapping_result=None, heuristic_n_max_naive=120) |
| |
| | GetAFMMapping (self, model, chem_mapping_result=None) |
| |
| | GetMapping (self, model, n_max_naive=40320) |
| |
| | GetRepr (self, substructure, model, topn=1, inclusion_radius=15.0, thresholds=[0.5, 1.0, 2.0, 4.0], bb_only=False, only_interchain=False, chem_mapping_result=None, global_mapping=None) |
| |
| | GetNMappings (self, model) |
| |
| | ProcessStructure (self, ent) |
| |
| | NWAlign (self, s1, s2, stype) |
| |
| | SEQRESAlign (self, seqres, s, s_ent) |
| |
| | ResNumAlign (self, s1, s2, s1_ent, s2_ent) |
| |
Class to compute chain mappings
All algorithms are performed on processed structures which fulfill
criteria as given in constructor arguments (*min_pep_length*,
"min_nuc_length") and only contain residues which have all required backbone
atoms. for peptide residues thats N, CA, C and CB (no CB for GLY), for
nucleotide residues thats O5', C5', C4', C3' and O3'.
Chain mapping is a three step process:
* Group chemically identical chains in *target* into so called chem
groups. There are two options to achieve this:
1) using pairwise alignments that are either computed with
Needleman-Wunsch (NW) or simply derived from residue numbers
(*resnum_alignments* flag). In case of NW, *pep_subst_mat*, *pep_gap_open*
and *pep_gap_ext* and their nucleotide equivalents are relevant. Two
chains are considered identical if they fulfill the *pep_seqid_thr* and
have at least *min_pep_length* aligned residues. Same logic is applied
for nucleotides with respective thresholds.
2) specify seqres sequences and provide the mapping manually. This can
be useful if you're loading data from an mmCIF file where you have mmCIF
entity information. Alignments are performed based on residue numbers
in this case and don't consider the *resnum_alignments* flag. Any
mismatch of one letter code in the structure and the respective SEQRES
raises an error.
* Map chains in an input model to these groups. Parameters for generating
alignments are the same as above. Model chains are mapped to the chem
group with highest sequence identity. Optionally, you can set a minimum
sequence identity threshold with *mdl_map_pep_seqid_thr*/
*mdl_map_nuc_seqid_thr* to avoid nonsensical mappings. You can either
get the group mapping with :func:`GetChemMapping` or directly call
one of the full fletched one-to-one chain mapping functions which
execute that step internally.
* Obtain one-to-one mapping for chains in an input model and
*target* with one of the available mapping functions. Just to get an
idea of complexity. If *target* and *model* are octamers, there are
``8! = 40320`` possible chain mappings.
:param target: Target structure onto which models are mapped.
Computations happen on a selection only containing
polypeptides and polynucleotides.
:type target: :class:`ost.mol.EntityView`/:class:`ost.mol.EntityHandle`
:param resnum_alignments: Use residue numbers instead of
Needleman-Wunsch to compute pairwise
alignments. Relevant for :attr:`~chem_groups`
if *seqres* is not specified explicitely.
:type resnum_alignments: :class:`bool`
:param pep_seqid_thr: Sequence identity threshold (in percent of aligned
columns) used to decide when two chains are
identical. 95 percent tolerates the few mutations
crystallographers like to do. Range: [0-100].
:type pep_seqid_thr: :class:`float`
:param nuc_seqid_thr: Nucleotide equivalent for *pep_seqid_thr*
:type nuc_seqid_thr: :class:`float`
:param pep_subst_mat: Substitution matrix to align peptide sequences,
irrelevant if *resnum_alignments* is True,
defaults to seq.alg.BLOSUM62
:type pep_subst_mat: :class:`ost.seq.alg.SubstWeightMatrix`
:param pep_gap_open: Gap open penalty to align peptide sequences,
irrelevant if *resnum_alignments* is True
:type pep_gap_open: :class:`int`
:param pep_gap_ext: Gap extension penalty to align peptide sequences,
irrelevant if *resnum_alignments* is True
:type pep_gap_ext: :class:`int`
:param nuc_subst_mat: Nucleotide equivalent for *pep_subst_mat*,
defaults to seq.alg.NUC44
:type nuc_subst_mat: :class:`ost.seq.alg.SubstWeightMatrix`
:param nuc_gap_open: Nucleotide equivalent for *pep_gap_open*
:type nuc_gap_open: :class:`int`
:param nuc_gap_ext: Nucleotide equivalent for *pep_gap_ext*
:type nuc_gap_ext: :class:`int`
:param min_pep_length: Minimal number of residues for a peptide chain to be
considered in target and in models.
:type min_pep_length: :class:`int`
:param min_nuc_length: Minimal number of residues for a nucleotide chain to be
considered in target and in models.
:type min_nuc_length: :class:`int`
:param n_max_naive: Max possible chain mappings that are enumerated in
:func:`~GetNaivelDDTMapping` /
:func:`~GetDecomposerlDDTMapping`. A
:class:`RuntimeError` is raised in case of bigger
complexity.
:type n_max_naive: :class:`int`
:param mdl_map_pep_seqid_thr: Sequence identity threshold (in percent of
aligned columns) to avoid non-sensical
mapping of model chains to reference chem
groups. If a value larger than 0.0 is
provided, minimal criteria for assignment
becomes a sequence identity of
*mdl_map_pep_seqid_thr* and at least
*min_pep_length* aligned residues. If set to
zero, it simply assigns a model chain to the
chem group with highest sequence identity.
Range: [0-100].
:type mdl_map_pep_seqid_thr: :class:`float`
:param mdl_map_nuc_seqid_thr: Nucleotide equivalent of
*mdl_map_pep_seqid_thr*
:type mdl_map_nuc_seqid_thr: :class:`float`
:param seqres: SEQRES sequences. All polymer chains in *target* will be
aligned to these sequences using a residue number based
alignment, effectively ignoring *resnum_alignments* in
the chem grouping step. Additionally, you need to
manually specify a mapping of the polymer chains using
*trg_seqres_mapping* and an error is raised otherwise.
The one letter codes in
the structure must exactly match the respective characters
in seqres and an error is raised if not. These seqres
sequences are set as :attr:`~chem_group_ref_seqs` and will
thus influence the mapping of model chains.
:type seqres: :class:`ost.seq.SequenceList`
:param trg_seqres_mapping: Maps each polymer chain in *target* to a sequence
in *seqres*. It's a :class:`dict` with chain name
as key and seqres sequence name as value.
:type trg_seqres_mapping: :class:`dict`
Definition at line 480 of file chain_mapping.py.
| _MapSequence |
( |
|
self, |
|
|
|
ref_seqs, |
|
|
|
ref_types, |
|
|
|
s, |
|
|
|
s_type, |
|
|
|
s_ent, |
|
|
|
seq_id_thr = 0.0, |
|
|
|
min_aln_length = 0 |
|
) |
| |
|
protected |
Tries top map *s* onto any of the sequences in *ref_seqs*
Computes alignments of *s* to each of the reference sequences of equal type
and sorts them by seqid*fraction_covered (seqid: sequence identity of
aligned columns in alignment, fraction_covered: Fraction of non-gap
characters in reference sequence that are covered by non-gap characters in
*s*). Best scoring mapping is returned. Optionally, *seq_id*/
*min_aln_length* thresholds can be enabled to avoid non-sensical mappings.
However, *min_aln_length* only has an effect if *seq_id_thr* > 0!!!
:param ref_seqs: Reference sequences
:type ref_seqs: :class:`ost.seq.SequenceList`
:param ref_types: Types of reference sequences, e.g.
ost.mol.ChemType.AminoAcids
:type ref_types: :class:`list` of :class:`ost.mol.ChemType`
:param s: Sequence to map
:type s: :class:`ost.seq.SequenceHandle`
:param s_type: Type of *s*, only try mapping to sequences in *ref_seqs*
with equal type as defined in *ref_types*
:param s_ent: Entity which represents *s*. Only relevant in case of
residue number alignments.
:type s_ent: :class:`ost.mol.EntityHandle`/:class:`ost.mol.EntityView`
:param seq_id_thr: Minimum sequence identity to be considered as match
:type seq_id_thr: :class:`float`
:param min_aln_length: Minimum number of aligned columns to be considered
as match. Only has an effect if *seq_id_thr* > 0!
:type min_aln_length: :class:`int`
:returns: Tuple with two elements. 1) index of sequence in *ref_seqs* to
which *s* can be mapped 2) Pairwise sequence alignment with
sequence from *ref_seqs* as first sequence. Both elements are
None if no mapping can be found or if thresholds are not
fulfilled for any alignment.
:raises: :class:`RuntimeError` if mapping is ambiguous, i.e. *s*
successfully maps to more than one sequence in *ref_seqs*
Definition at line 2221 of file chain_mapping.py.
| GetAFMMapping |
( |
|
self, |
|
|
|
model, |
|
|
|
chem_mapping_result = None |
|
) |
| |
Identify chain mapping as in AlphaFold-Multimer (AFM)
Described in section 7.3 of
Richard Evans, Michael O’Neill, Alexander Pritzel, Natasha Antropova,
Andrew Senior, Tim Green, Augustin Žídek, Russ Bates, Sam Blackwell,
Jason Yim, et al. Protein complex prediction with AlphaFold-Multimer.
bioRxiv preprint bioRxiv:10.1101/2021.10.04.463034, 2021.
To summarize (largely follows the description in the AF-Multimer paper):
One anchor chain in the ground truth (reference) is selected. If the
reference has stoichiometry A3B2, an arbitary chain in B is selected
as it has smaller ambiguity. In a tie, for example A2B2, the longest
chain in A, B is selected.
Given a model chain with same sequence as the reference anchor chain,
a CA-RMSD (C3' for nucleotides) based superposition is performed. Chains
are then greedily assigned by minimum distance of their geometric
centers. This procedure is repeated for every model chain with same
sequence and the assignment with minimal RMSD of the geometric centers
is returned.
A modification of this algorithm allows to deal with different
stoichiometries in reference and model. In the anchor selection, this
function ensures that there is at least one model chain that can be
mapped to this anchor. If the logic above does not identify a mappable
chain pair for the smallest group, it continues with the second smallest
group etc.
:param model: Model to map
:type model: :class:`ost.mol.EntityView`/:class:`ost.mol.EntityHandle`
:param chem_mapping_result: Pro param. The result of
:func:`~GetChemMapping` where you provided
*model*. If set, *model* parameter is not
used.
:type chem_mapping_result: :class:`tuple`
:returns: A :class:`MappingResult`
Definition at line 1253 of file chain_mapping.py.
| GetlDDTMapping |
( |
|
self, |
|
|
|
model, |
|
|
|
inclusion_radius = 15.0, |
|
|
|
thresholds = [0.5, 1.0, 2.0, 4.0], |
|
|
|
strategy = "heuristic", |
|
|
|
steep_opt_rate = None, |
|
|
|
block_seed_size = 5, |
|
|
|
block_blocks_per_chem_group = 5, |
|
|
|
chem_mapping_result = None, |
|
|
|
heuristic_n_max_naive = 40320 |
|
) |
| |
Identify chain mapping by optimizing lDDT score
Maps *model* chain sequences to :attr:`~chem_groups` and find mapping
based on backbone only lDDT score (CA for amino acids C3' for
Nucleotides).
Either performs a naive search, i.e. enumerate all possible mappings or
executes a greedy strategy that tries to identify a (close to) optimal
mapping in an iterative way by starting from a start mapping (seed). In
each iteration, the one-to-one mapping that leads to highest increase
in number of conserved contacts is added with the additional requirement
that this added mapping must have non-zero interface counts towards the
already mapped chains. So basically we're "growing" the mapped structure
by only adding connected stuff.
The available strategies:
* **naive**: Enumerates all possible mappings and returns best
* **greedy_full**: try multiple seeds for greedy extension, i.e. try
all ref/mdl chain combinations within the respective chem groups and
retain the mapping leading to the best lDDT.
* **greedy_block**: try multiple seeds for greedy extension, i.e. try
all ref/mdl chain combinations within the respective chem groups and
extend them to *block_seed_size*. *block_blocks_per_chem_group*
for each chem group are selected for exhaustive extension.
* **heuristic**: Uses *naive* strategy if number of possible mappings
is within *heuristic_n_max_naive*. The default of 40320 corresponds
to an octamer (8!=40320). A structure with stoichiometry A6B2 would be
6!*2!=1440 etc. If the number of possible mappings is larger,
*greedy_full* is used.
Sets :attr:`MappingResult.opt_score` in case of no trivial one-to-one
mapping.
:param model: Model to map
:type model: :class:`ost.mol.EntityView`/:class:`ost.mol.EntityHandle`
:param inclusion_radius: Inclusion radius for lDDT
:type inclusion_radius: :class:`float`
:param thresholds: Thresholds for lDDT
:type thresholds: :class:`list` of :class:`float`
:param strategy: Strategy to find mapping. Must be in ["naive",
"greedy_full", "greedy_block"]
:type strategy: :class:`str`
:param steep_opt_rate: Only relevant for greedy strategies.
If set, every *steep_opt_rate* mappings, a simple
optimization is executed with the goal of
avoiding local minima. The optimization
iteratively checks all possible swaps of mappings
within their respective chem groups and accepts
swaps that improve lDDT score. Iteration stops as
soon as no improvement can be achieved anymore.
:type steep_opt_rate: :class:`int`
:param block_seed_size: Param for *greedy_block* strategy - Initial seeds
are extended by that number of chains.
:type block_seed_size: :class:`int`
:param block_blocks_per_chem_group: Param for *greedy_block* strategy -
Number of blocks per chem group that
are extended in an initial search
for high scoring local solutions.
:type block_blocks_per_chem_group: :class:`int`
:param chem_mapping_result: Pro param. The result of
:func:`~GetChemMapping` where you provided
*model*. If set, *model* parameter is not
used.
:type chem_mapping_result: :class:`tuple`
:returns: A :class:`MappingResult`
Definition at line 817 of file chain_mapping.py.
| GetQSScoreMapping |
( |
|
self, |
|
|
|
model, |
|
|
|
contact_d = 12.0, |
|
|
|
strategy = "heuristic", |
|
|
|
block_seed_size = 5, |
|
|
|
block_blocks_per_chem_group = 5, |
|
|
|
heuristic_n_max_naive = 40320, |
|
|
|
steep_opt_rate = None, |
|
|
|
chem_mapping_result = None, |
|
|
|
greedy_prune_contact_map = True |
|
) |
| |
Identify chain mapping based on QSScore
Scoring is based on CA/C3' positions which are present in all chains of
a :attr:`chem_groups` as well as the *model* chains which are mapped to
that respective chem group.
The following strategies are available:
* **naive**: Naively iterate all possible mappings and return best based
on QS score.
* **greedy_full**: try multiple seeds for greedy extension, i.e. try
all ref/mdl chain combinations within the respective chem groups and
retain the mapping leading to the best QS-score.
* **greedy_block**: try multiple seeds for greedy extension, i.e. try
all ref/mdl chain combinations within the respective chem groups and
extend them to *block_seed_size*. *block_blocks_per_chem_group*
for each chem group are selected for exhaustive extension.
* **heuristic**: Uses *naive* strategy if number of possible mappings
is within *heuristic_n_max_naive*. The default of 40320 corresponds
to an octamer (8!=40320). A structure with stoichiometry A6B2 would be
6!*2!=1440 etc. If the number of possible mappings is larger,
*greedy_full* is used.
Sets :attr:`MappingResult.opt_score` in case of no trivial one-to-one
mapping.
:param model: Model to map
:type model: :class:`ost.mol.EntityView`/:class:`ost.mol.EntityHandle`
:param contact_d: Max distance between two residues to be considered as
contact in qs scoring
:type contact_d: :class:`float`
:param strategy: Strategy for sampling, must be in ["naive",
greedy_full", "greedy_block"]
:type strategy: :class:`str`
:param chem_mapping_result: Pro param. The result of
:func:`~GetChemMapping` where you provided
*model*. If set, *model* parameter is not
used.
:type chem_mapping_result: :class:`tuple`
:param greedy_prune_contact_map: Relevant for all strategies that use
greedy extensions. If True, only chains
with at least 3 contacts (8A CB
distance) towards already mapped chains
in trg/mdl are considered for
extension. All chains that give a
potential non-zero QS-score increase
are used otherwise (at least one
contact within 12A). The consequence
is reduced runtime and usually no
real reduction in accuracy.
:returns: A :class:`MappingResult`
Definition at line 964 of file chain_mapping.py.
| GetRMSDMapping |
( |
|
self, |
|
|
|
model, |
|
|
|
strategy = "heuristic", |
|
|
|
subsampling = 50, |
|
|
|
chem_mapping_result = None, |
|
|
|
heuristic_n_max_naive = 120 |
|
) |
| |
Identify chain mapping based on minimal RMSD superposition
Superposition and scoring is based on CA/C3' positions which are present
in all chains of a :attr:`chem_groups` as well as the *model*
chains which are mapped to that respective chem group.
The following strategies are available:
* **naive**: Naively iterate all possible mappings and return the one
with lowes RMSD.
* **greedy_single**: perform all vs. all single chain superpositions
within the respective ref/mdl chem groups to use as starting points.
For each starting point, iteratively add the model/target chain pair
with lowest RMSD until a full mapping is achieved. The mapping with
lowest RMSD is returned.
* **greedy_single_centroid**: Same as *greedy_single* but iteratively
adds model/target chain pairs with lowest centroid distance. The
mapping with the lowest centroid RMSD is returned. This is mainly for
evaluation purposes. One known failure mode is that intertwined
structures may have multiple centroids sitting on top of each other.
RMSD and thus *greedy_single* are better able to disentangle this
situation.
* **greedy_iterative**: Same as greedy_single_rmsd exept that the
transformation gets updated with each added chain pair.
* **heuristic**: Uses *naive* strategy if number of possible mappings
is within *heuristic_n_max_naive*. The default of 120 corresponds
to a homo-pentamer (5!=120). If the number of possible mappings is
larger, *greedy_iterative* is used.
:param model: Model to map
:type model: :class:`ost.mol.EntityView`/:class:`ost.mol.EntityHandle`
:param strategy: Strategy for sampling. Must be in ["naive",
"greedy_single", "greedy_iterative"]
:type strategy: :class:`str`
:param subsampling: If given, only an equally distributed subset
of CA/C3' positions in each chain are used for
superposition/scoring.
:type subsampling: :class:`int`
:param chem_mapping_result: Pro param. The result of
:func:`~GetChemMapping` where you provided
*model*. If set, *model* parameter is not
used.
:type chem_mapping_result: :class:`tuple`
:returns: A :class:`MappingResult`
Definition at line 1093 of file chain_mapping.py.
| ProcessStructure |
( |
|
self, |
|
|
|
ent |
|
) |
| |
Entity processing for chain mapping
* Selects view containing peptide and nucleotide residues which have
required backbone atoms present - for peptide residues thats
N, CA, C and CB (no CB for GLY), for nucleotide residues thats
O5', C5', C4', C3' and O3'.
* filters view by chain lengths, see *min_pep_length* and
*min_nuc_length* in constructor
* Extracts atom sequences for each chain in that view
* If residue number alignments are used, strictly increasing residue
numbers without insertion codes are ensured in each chain
:param ent: Entity to process
:type ent: :class:`ost.mol.EntityView`/:class:`ost.mol.EntityHandle`
:returns: Tuple with 3 elements: 1) :class:`ost.mol.EntityView`
containing peptide and nucleotide residues 2)
:class:`ost.seq.SequenceList` containing ATOMSEQ sequences
for each polypeptide chain in returned view
3) same for polynucleotide chains
Definition at line 1765 of file chain_mapping.py.