Modelling Algorithms¶
A collection of algorithms that can be useful in modelling
Rigid Blocks¶
RMSD is a typical measure for similarity of two structures. Given an atom atom mapping between two structures, the minimum RMSD and the according superposition can efficiently be calculated using an approach based on singular value decomposition. This approach is problematic if there are very dissimilar regions or when domain movement events occur. We can therefore implement an iterative superposition. The two structures undergo an initial superposition. For every iteration we then select a subset of atoms that are within a certain distance threshold that serve as input for the next superposition. This iterative superpostion typically converges to the largest common subpart but is non-deterministic since it depends on the initial superposition. The RigidBlocks algorithm is based on only the CA positions and performs this iterative superposition multiple times by using a sliding window to select the initial subset and gathers all unique results. These results can be very similar and only differ by single positions. The algorithm therefore reduces the amount of solutions by merging them based on a threshold of similarity. The similarity is defined by the fraction of positions in solution A that are also present in solution B. As a final result, the algorithm therefore detects common rigid subsets of positions.
- promod3.modelling.RigidBlocks(bb_list_one, bb_list_two[, window_length = 12, max_iterations=20, distance_thresh=3.0, cluster_thresh=0.9])¶
Performs the RigidBlock algorithm on given input
- Parameters:
bb_list_one (
promod3.loop.BackboneList
) – First piece structural information from which CA positions will be extractedbb_list_two (
promod3.loop.BackboneList
) – Second piece of structural information from which CA positions will be extractedwindow_length (
int
) – Length of sliding window to generate initial subsetsmax_iterations (
int
) – Maximal numbers of iterations for every single iterative superpositiondistance_thresh (
float
) – Maximal distance two CA positions can have to be considered in the same rigid block and to select the common subset for the next iteration of the iterative superpositioncluster_thresh (
float
) – Threshold of similarity to perform the final merging of the solutions
- Returns:
tuple
with the first element being alist
oflist
defining the indices of the common subsets (rigid blocks) relative to the inputpromod3.loop.BackboneList
objects and the second element being alist
ofost.geom.Mat4
defining the transformations to superpose the according positions in bb_list_one onto bb_list_two
- promod3.modelling.RigidBlocks(aln[, seq_one_idx=0, seq_two_idx=1, window_length = 12, max_iterations=20, distance_thresh=3.0, cluster_thresh=0.9])¶
Performs the RigidBlock algorithm on given input
- Parameters:
aln (
ost.seq.AlignmentHandle
) – An alignment with attachedost.mol.EntityView
objects from which the positions are extractedseq_idx_one (
int
) – The idx of the first sequence from which the CA positions will be extractedseq_idx_two (
int
) – The idx of the second sequence from which the CA positions will be extractedwindow_length (
int
) – Length of sliding window to generate initial subsetsmax_iterations (
int
) – Maximal numbers of iterations for every single iterative superpositiondistance_thresh (
float
) – Maximal distance two CA positions can have to be considered in the same rigid block and to select the common subset for the next iteration of the iterative superpositioncluster_thresh (
float
) – Threshold of similarity to perform the final merging of the solutions
- Returns:
tuple
with the first element being alist
oflist
defining the column indices of the common subsets (rigid blocks) relative to the inputost.seq.AlignmentHandle
and the second element being alist
ofost.geom.Mat4
defining the transformations to superpose the according positions from the first sequence onto the second sequence.
- promod3.modelling.RigidBlocks(pos_one, pos_two[, window_length = 12, max_iterations=20, distance_thresh=3.0, cluster_thresh=0.9])¶
Performs the RigidBlock algorithm on given input
- Parameters:
pos_one (
ost.geom.Vec3List
) – First piece position informationpos_two (
ost.geom.Vec3List
) – Second piece of position informationwindow_length (
int
) – Length of sliding window to generate initial subsetsmax_iterations (
int
) – Maximal numbers of iterations for every single iterative superpositiondistance_thresh (
float
) – Maximal distance two CA positions can have to be considered in the same rigid block and to select the common subset for the next iteration of the iterative superpositioncluster_thresh (
float
) – Threshold of similarity to perform the final merging of the solutions
- Returns:
tuple
with the first element being alist
oflist
defining the indices of the common subsets (rigid blocks) relative to the inputost.geom.Vec3List
objects and the second element being alist
ofost.geom.Mat4
defining the transformations to superpose the according positions in pos_one onto pos_two
De Novo Modelling¶
ProMod3 provides algorithms for sampling and fragment detection. Here we provide an object, that facilitates fragment detection and caching, as well as a convenient function to combine the functionalities into an example pipeline.
- class promod3.modelling.FraggerHandle(sequence, profile=None, psipred_pred=None, fragment_length=9, fragments_per_position=100, rmsd_thresh=0.0, structure_db=None, torsion_sampler_coil=None, torsion_sampler_helix=None, torsion_sampler_extended=None)¶
Handler for
Fragger
objects linked to a specific chain.Tries to get the most accurate fragments given your input. You can only provide a SEQRES, the returned fragments are then searched by using sequence similarity as the only target value. You can massively increase the accuracy of the found fragments by providing a secondary structure prediction and / or sequence profile.
Following features influence the fragment search given your input:
sequence:
Sequence Similarity with BLOSUM62
sequence, psipred_pred:
Sequence Similarity with BLOSUM62
Secondary Structure Agreement
Secondary Structure Dependent Torsion Probabilities
sequence, profile:
Sequence Profile Score
Structure Profile Score
sequence, psipred_pred, profile:
Secondary Structure Agreement
Secondary Structure Dependent Torsion Probabilities
Sequence Profile Score
Structure Profile Score
The FraggerHandle internally uses the
promod3.loop.FraggerMap
for caching. You can therefore request fragments for a certain position several times and the search is performed only once. This also allows to save the FraggerHandle to disk. When loading the FraggerHandle again, you need to provide all parameters again. These parameters must be exactly the same than the ones you used when initially constructing the FraggerHandle, especially the structure database. Weird things are happening otherwise.- Parameters:
sequence (
str
/ost.seq.SequenceHandle
) – SEQRES for this chainprofile (
ost.seq.ProfileHandle
) – Sequence profile for this chain.psipred_pred (
promod3.loop.PsipredPrediction
) – Psipred prediction for this chain.fragment_length (
int
) – Length (num. residues) of fragments to be extracted.fragments_per_position (
int
) – Number of fragments to be extracted at each position.rmsd_thresh (
float
) – To guarantee structural diversity, no pair of fragments at a given position will have RMSD below rmsd_thresh.structure_db (
promod3.loop.StructureDB
) – Source of structural datatorsion_sampler_coil (
promod3.loop.TorsionSampler
) – Torsion sampler for coil residues.torsion_sampler_helix (
promod3.loop.TorsionSampler
) – Torsion sampler for helical residues.torsion_sampler_extended (
promod3.loop.TorsionSampler
) – Torsion sampler for extended residues.
- Get(frag_pos)¶
Get fragger for sequence at index frag_pos..frag_pos+frag_length-1.
- Parameters:
frag_pos (
int
) – Start-index (note that sequence-indexing starts at 0)- Returns:
A
Fragger
object.- Raises:
ValueError
if index out-of-bounds.
- GetList(pos_start=0, pos_end=-1)¶
Get List of fraggers covering sequence indices pos_start..pos_end.
This will return an empty list if range is smaller than fragment_length.
- LoadCached(filename)¶
Load fragger objects stored with
SaveCached()
. Note that here we require that the same structure db is set as was used when filename was saved.
- SaveCached(filename)¶
Save cached fraggers.
- promod3.modelling.GenerateDeNovoTrajectories(sequence, num_trajectories=200, avg_sampling_per_position=600, profile=None, psipred_prediction=None, fragment_handler=None, scorer=None, scorer_env=None, scoring_weights=None)¶
Example de novo modelling pipeline based on Fragment sampling and backbone scoring. Take this as a starting point for more advanced de novo procedures.
- Parameters:
sequence (
str
) – The sequence you want to samplenum_trajectories (
int
) – The number of sampling trajectories you want to generateavg_sampling_per_position – Number of Monte Carlo sampling steps the total number is: len(sequence) * avg_sampling_per_position
profile (
ost.seq.ProfileHandle
) – The sequence profile for sequence. This increases the fragment search performance.psipred_prediction (
promod3.loop.PsipredPrediction
) – The psipred prediction for sequence. This increases the fragment search performancefragment_handler (
promod3.modelling.FraggerHandle
) – You can provide already initialized fragments. If you pass this parameter, profile and psipred_prediction get neglected and do not influence the fragment search, the ones you initialized fragment_handler with get used instead.scorer (
promod3.scoring.BackboneOverallScorer
) – Scorer doing the backbone scoring. If not provided, a default one gets loaded with default objects with following keys: clash, reduced, cb_packing, hbond, cbeta, torsion and pairwisescorer_env (
promod3.scoring.BackboneScoreEnv
) – The scoring env that relates to scorer This environment will be changed!scoring_weights (
dict
) – Linear weights for different scores. If not provided, the output of ScoringWeights.GetWeights() is used. Please note, that the weights must be consistent with the keys of the scores in scorer
- Returns:
A
promod3.loop.LoopCandidates
object containing num_trajectories elements for further processing
Motif Finder¶
Distinct spatial arrangements of atoms or functional groups are key for protein function. For their detection, ProMod3 implements the MotifFinder algorithm which is based on geometric hashing as described by Nussinov and Wolfson [nussinov1991]. The algorithm consists of a learning stage, a detection stage and a refinement stage.
Learning Stage: A motif (query) is represented by a set of coordinates. Triplets (p1, p2, p3) of coordinates are selected that define triangles. For each triangle one can define an orthogonal vector basis (in our case v1 = norm(p2-p1), v3 = norm(cross(v1,p3-p1), v2 = norm(cross(v1,v3)))). For each coordinate not in [p1,p2,p3], we add the identity of the query/triangle as value to a hash map. The corresponding key consists of discretized values describing the edge lengths of the triangle, as well as the coordinate transformed into the triangle specific orthogonal vector basis. That’s 6 numbers in total.
Detection Stage: The goal is to identify one or several subsets of target coordinates that resemble an input query. We first setup an accumulator containing a counter for each triangle observed in the input query. We then iterate over each possible triangle with vertices p1, p2 and p3 in the target coordinates. At the beginning of each iteration, all counters in the accumulator are set to zero. Again, we build a vector basis given that triangle and transform all coordinates not in [p1,p2,p3] into that vector space. For each transformed coordinate we obtain a key for the query hash map. If there is one or several values at that location in the hash map, we increment the corresponding locations in the accumulator. Once all coordinates are processed, we search for high counts in the accumulator. Given N query coordinates, we keep a solution for further refinement if count/(N-3) >= hash_tresh. This is repeated until all triangles in the target are processed. One key problem with this approach is the discretization of floating point numbers that give raise to the hash map keys. Two extremely close values might end up in different bins just because they are close to the bin boundaries. For each of the 6 relevant numbers we estimate the actual bin as well as the closest neighbouring bin. Processing all possible combinations results in 64 hash map lookups instead of only one.
Refinement Stage: Every potential solution identified in the detection stage is further refined based on the distance_thresh and refine_thresh parameters. A potential solution found in the detection stage is a pair of triangles, one in the query and one in the target, for which we find many matching coordinates in their respective vector space. We start with a coordinate mapping based on the triangle vertices from the query and the target (3 pairs). This coordinate mapping is iteratively updated by estimating the minimum RMSD superposition of the mapped query coordinates onto the target, apply that superposition on the query, find the closest target coordinate for each coordinate in the query and redo the mapping by including all pairs with minimum distance < distance_thresh. Iteration stops if nothing changes anymore. The solution is returned to the user if the final fraction of mapped query coordinates is larger or equal refine_thresh. The larger the mapping, the more accurate the superposition. As we start with only the three triangle vertices, distance_thresh is doubled for the initial iteration.
# Example script that loads protein structures that contain ATP and
# generates motif queries describing their binding pockets.
# In a second step, those pockets are matched against a protein
# structure that only contains an ATP analog. The ATP from every
# match is transformed and stored to disk for further processing.
from ost import io, geom, mol
from promod3 import modelling
files = ['data/1E2Q.pdb', 'data/1KO5.pdb', 'data/2IYW.pdb']
atp_list = list()
query_list = list()
for f in files:
prot = io.LoadPDB(f)
peptide_sel = prot.Select("peptide=true")
atp_sel = prot.Select("rname=ATP")
# generate a single query for each ATP pocket
for atp_idx, atp_r in enumerate(atp_sel.residues):
pocket_view = peptide_sel.CreateEmptyView()
for atp_at in atp_r.atoms:
close_at = peptide_sel.FindWithin(atp_at.GetPos(), 4.5)
for at in close_at:
r = at.handle.GetResidue()
add_flag = mol.INCLUDE_ATOMS | mol.CHECK_DUPLICATES
pocket_view.AddResidue(r, add_flag)
ca_positions = geom.Vec3List()
for res in pocket_view.residues:
ca_positions.append(res.FindAtom("CA").GetPos())
i = "%s_%i"%(f, atp_idx)
query = modelling.MotifQuery(ca_positions, i, 4.0, 9.0, 1.0)
query_list.append(query)
# create an entity from atp for later use
atp_view = prot.CreateEmptyView()
atp_view.AddResidue(atp_r, mol.INCLUDE_ATOMS)
atp_list.append(mol.CreateEntityFromView(atp_view, True))
# That's it, let's combine the single queries
full_query = modelling.MotifQuery(query_list)
prot = io.LoadPDB("data/1AKE.pdb")
peptide_sel = prot.Select("peptide=true")
ca_positions = geom.Vec3List()
for r in peptide_sel.residues:
ca_positions.append(r.FindAtom("CA").GetPos())
# search all matches, fetch the corresponding atps,
# transform them onto the 1ake binding site and dump them to disk
matches = modelling.FindMotifs(full_query, ca_positions)
for m_idx, m in enumerate(matches):
atp = atp_list[m.query_idx].Copy()
atp.EditXCS().ApplyTransform(m.mat)
io.SavePDB(atp, "m_%i.pdb"%(m_idx))
- class promod3.modelling.MotifQuery(positions, identifier, min_triangle_edge_length, max_triangle_edge_length, bin_size)¶
- class promod3.modelling.MotifQuery(positions, identifier, min_triangle_edge_length, max_triangle_edge_length, bin_size, flags)
- class promod3.modelling.MotifQuery(query_list)
A single query or a container of queries. The constructor performs the learning stage of a single query or combines several queries, so they can be searched at once.
- Parameters:
positions (
ost.geom.Vec3List
) – Coordinates of the queryidentifier (
str
) – Descriptor of the querymin_triangle_edge_length (
float
) – To avoid the full O(n^3) hell, triangles with any edge length below min_triangle_edge_length are skippedmax_triangle_edge_length (
float
) – Same as min_triangle_edge_length but upper boundbin_size (
float
) – Bin size in A, relevant to generate hash map keysflags (
list
ofint
) – Flag in range [0,63] for every coordinate in positions. They’re also added to the hash map keys (default: 0). This means that additionally to having a matching relative position, the coordinates must also have a matching flag in the detection/refinement stage. If not provided (in the query and in the search), only coordinates matter.query_list (
list
ofMotifQuery
) – E pluribus unum
- GetPositions(query_idx)¶
Returns coordinates of specified query
- Parameters:
query_idx (
int
) – Query from which you want the positions
- GetIdentifiers()¶
Returns a list of all query identifiers.
- GetN()¶
Returns the number of queries
- class promod3.modelling.MotifMatch¶
Object that holds information about a match found in
FindMotifs()
- query_idx¶
Index of matching query
- mat¶
Transformation matrix to superpose matching query onto target
- alignment¶
List of tuples which define matching pairs of query/target coordinates
- promod3.modelling.FindMotifs(query, target_positions, hash_tresh=0.4, distance_thresh=1.0, refine_thresh=0.7, flags=list(), swap_thresh=False)¶
Performs the detection and refinement stages of the geometric hashing algorithm.
- Parameters:
query – Query to be searched
target_positions – Coordinates of the target
hash_thresh – Parameter relevant for detection stage
distance_thresh – Parameter relevant for refinement stage
refine_thresh – Parameter relevant for refinement stage
flags – Equivalent to flags in
MotifQuery
constructor. If you didn’t provide anything there, this can be ignored. Only the actual coordinates matter in this case.swap_thresh – hash_thresh and refine_thresh refer to fraction of covered positions in query. When setting this to True, they refer to the fraction of covered positions in target_positions.
- Returns:
All found matches
- Return type:
list
ofMotifMatch
AFDB Modelling¶
Template based modelling using AFDB as template library comes with two challenges for which ProMod3 provides solutions:
efficient structure storage for the whole AFDB:
FSStructureServer
fast sequence searches with limited sensitivity:
PentaMatch
The creation of these two object requires extensive preprocessing. The required scripts and documentation are available in <GIT_ROOT>/extras/data_generation/afdb_modelling.
Basic modelling functionality is available in the following two functions:
- promod3.modelling.AFDBTPLSearch(fs_server, pentamatch, trg_seq, pentamatch_n=100, seqid_thresh=70, tpl_n=5)¶
Searches tpl_n templates in fs_server/pentamatch
Step 1: Identifies pentamatch_n sequences in pentamatch with largest number of matching pentamers with respect to trg_seq. Step 2: Generate pairwise alignments with
ost.seq.alg.LocalAlign()
and only retain the ones with seqid >= seqid_thresh. Step 3: Extract respective templates from fs_server and score them by the sum of plDDT of aligned residues divided by trg_seq length. Step 4: Return top tpl_n (or less)- Parameters:
fs_server (
FSStructureServer
) – Structure database - The AFDBpentamatch (
PentaMatch
) – Pentamatch object specific for fs_servertrg_seq (
ost.seq.SequenceHandle
/str
) – Target sequenceseqid_thresh (
int
) – Sequence Identity threshold [0-100] that alignment is considered furthertpl_n (
int
) – Number of templates that are finally returned based on described scoring
- Pentamatch_n:
Number of sequences that are initially searched in pentamatch
- Returns:
list
of pairs with first element being the tpl score, the second element being aost.seq.AlignmentHandle
with first sequence being trg_seq and second sequence the hit found in fs_server with structure attached. If fs_server has been generated with the default procedure described in the docs, additional info is available in the name of the attached structure. That’s accessible with aln.GetSequence(1).GetAttachedView().GetName(). That is structured as “<UniprotAC> <Fragment> <AFDB version> <Idx>” where idx refers to the raw idx of the template in fs_server.
- promod3.modelling.AFDBModel(fs_server, pentamatch, trg_seq, transfer_bfactors=False)¶
Build model with AFDB as template library
- Parameters:
fs_server (
FSStructureServer
) – Structure database - The AFDBpentamatch (
PentaMatch
) – Pentamatch object specific for fs_servertrg_seq (
ost.seq.SequenceHandle
/str
) – Target sequencetransfer_bfactors – Simple heuristic to transfer bfactors (plDDT) to model. In case of an aligned residue, bfactor (i.e. plDDT) gets simply transferred. Gaps are treated with a heuristic. This operates on the full stretch of remodelled amino acids and not solely on the gap indicated by the alignment. We first derive a minimum plDDT which is 0.5*min(n_stem_plddt, c_stem_plddt). The plDDT values of the processed amino acids then linearly decreases from the stem towards that minimum with a slope of 0.25 (i.e. reach the minimum value when they’re 4 residues away).
- Returns:
tuple
with 4 elements. 1: The model 2: The model score based on plDDT 3: Pairwise alignment (first seq: trg_seq, second seq: tpl seq) 4: Template name (formatted as “<uniprot AC> <AFDB_fragment> <AFDB_version> <chain name>”). If no appropriate template can be found, all 4 elements are None.
- class promod3.modelling.FSStructureServer(db_dir)¶
FSStructureServer - A filesystem based AFDB structure server
Stores OMF data entries in huge binary files and extracts the respective binary data blobs with an indexing mechanism. Efficient reading of these huge files is delegated to the OS using the Python mmap module.
Data creation happens with static Create function:
FromDataChunks()
. The required preprocessing must be done with external scripts.- Parameters:
db_dir (
str
) – Directory containing required files. 1) pos.dat (seepos
) 2) length.dat (seelength
) 3) chunk.dat (seechunk
) 4) indexer.dat (seeindexer
) 4) one or several files with pattern fs_data_[x].dat (seedata
)
- property pos¶
Start positions of data in internal linear memory layout
- Type:
np.ndarray
with dtype np.int64
- property length¶
Lengths of data in internal linear memory layout
- Type:
np.ndarray
with dtype np.int32
- property chunk¶
Chunk, in which entry is stored
- Type:
np.ndarray
with dtype np.int16
- property indexer¶
Internal data structure - Relates entries with data indices
- Type:
np.ndarray
with dtype np.int32
- property search_keys¶
Internal data structure - Relates entries with data indices
- GetIdx(uniprot_ac, fragment='F1', version='v4')¶
Get internal idx of stored data
Can be used for data retrieval with
GetOMFByIdx()
- Parameters:
- Returns:
Internal idx of that entry
- Raises:
RuntimeError
if no such entry exists
- GetOMFByIdx(idx)¶
Get stored OMF data structure
- Parameters:
idx (
int
) – Internal index which can be derived fromGetIdx()
- Returns:
OMF data structure of type
ost.io.OMF
- Raises:
RuntimeError
if idx is out of range
- GetOMFByPLC(pos, length, chunk)¶
Get stored OMF data structure
Get data by explicitely specifying PLC (pos, length, chunk). For expert use only, no range checks performed. Instead of providing a uniprot AC or an index, this function takes directly the internal pos, length and chunk parameters that are stored for that particular index. Use case: avoid loading the respective data files and only open the memory mapped files.
- GetOMF(uniprot_ac, fragment='F1', version='v4')¶
Get stored OMF data structure
- Parameters:
- Returns:
OMF data structure of type
ost.io.OMF
- Raises:
RuntimeError
if no such entry exists
- static FromDataChunks(chunk_dir, db_dir, chunk_bytes=None)¶
Static method to create new database from preprocessed data
Data preprocessing consists of creating several data chunks that are pickled to disk. In detail: each chunk is a pickled file containing a list, where each entry is a tuple with 4 elements: 1) uniprot_ac (
str
) 2) fragment (str
) 3) version (str
) 4) structure data (ost.io.OMF
object which has been written to a bytes object and compressed with gzip)The data itself is again stored in chunks of binary data which are indexed.
- Parameters:
chunk_dir (
str
) – Path to directory containing described data chunksdb_dir (
str
) – Output directory - Creates all files that are needed forFSStructureServer
chunk_bytes (
int
) – Size in bytes of binary data chunks in final database - default None: Everything ends up in one single chunk
- Returns:
FSStructureServer
with all data from chunk_dir
- class promod3.modelling.PentaMatch(db_dir)¶
Pentamer matching for fast sequence searches
PentaMatch
has fast sequence searches with low sensitivity as use case. Specifically searching the full AFDB. Stores all unique pentamers for each search sequence. Given a query sequence, it computes the number of matching pentamers with respect to each search sequence and returns the top hits.- Parameters:
db_dir (
str
) – Directory containing all required files (indexer.dat, pos.dat, length.dat, meta.dat). NewPentaMatch
objects can be derived using the staticFromSeqList()
creator.- Raises:
RuntimeError
if any required file is missing in db_dir
- property indexer¶
Entry indices for pentamers
Entry data for one pentamer can be extracted with the respective values in
pos
andlength
- Type:
memory mapped
np.ndarray
of dtype np.int32
- property pos¶
Start position for each pentamer in
indexer
- Type:
np.ndarray
of dtype np.int64 with 3.2E6 entries
- property length¶
Length for each pentamer in
indexer
- Type:
np.ndarray
of dtype np.int32 with 3.2E6 entries
- property N¶
Number of entries in underlying
FSStructureServer
- Type:
- TopN(N, sequence, return_counts=False, unique_pentamers=True)¶
Find top-N matches given sequence
Identifies unique pentamers in sequence and counts number of occurences in each entry. Returns the top-N entries with respect to counts.
- Parameters:
N (
int
) – Number of results to returnsequence (
str
) – Sequence to searchreturn_counts (
bool
) – Additionally return underlying pentamer countsunique_pentamers (
bool
) – Set to True ifindexer
contains only unique pentamers for each entry. This way we can use faster methods to accumulate counts. In detail: accumulator[indices] += 1 is much faster than np.add.at(accumulator, indices, 1). But the latter is required if there are duplicates.
- Returns:
list
ofint
with length N specifying entry indices. If return_counts is true, thelist
containstuple
with two elements: 1) count 2) index.- Raises:
RuntimeError
if N is invalid or sequence is shorter than 5 characters
- static FromSeqList(fasta_file, db_dir, entries_from_seqnames=False)¶
Creates PentaMatch object from Fasta file
- Parameters:
fasta_file (
str
) – Path to Fasta file with sequencesdb_dir (
str
) – Files required forPentaMatch
will be dumped here, will be created if non-existent.entries_from_seqnames (
bool
) – If set to False, indices returned byTopN()
refer to position in fasta_file. If set to True, integer indices are parsed from sequence name.
- Returns:
class:PentaMatch
- Raises:
ost.Error
if entries_from_seqnames is True but sequence name cannot be casted to int.