hhblits
- Search related sequences in databases¶
Introduction¶
HHblits is a sequence search tool like BLAST but able to find more distant homologs. This is achieved by aligning hidden Markov models (HMM) in the search process as opposed to sequence-sequence searches in BLAST. HHblits works on a HMM database, usually that one is provided, queried with a HMM representing your target sequence. The latter one needs to be calculated before the actual search. The software suite needed for HHblits can be found on github. Alternatively, the deprecated HHblits 2.x suite can be found here: here.
On HHblits Versions¶
The binding for HHblits 3 has internally been forked from the HHblits 2 binding. The binding for HHblits 2 is considered deprecated and doesn’t receive bugfixes anymore. Also the documentation refers to the HHblits 3 binding. The different bindings can be imported explicitely:
from ost.bindings import hhblits2
from ost.bindings import hhblits3
Alternatively you can let OpenStructure figure out the HHblits version you’re using and import the appropriate binding for you under the base name hhblits. This assumes the HHblits binary (hhblits) to be in your path and raises an error otherwise.
from ost.bindings import hhblits
Examples¶
A typical search: Get an instance of the binding, build the search profile out of the query sequence, run the search and iterate results.
from ost.bindings import hhblits3
# Create a SequenceHandle, alternatively you can load any sequence in
# FASTA format using ost.io.LoadSequence(<PATH_TO_SEQUENCE>)
query_seq = seq.CreateSequence('Query',
'TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN')
# set up the search environment
# lets assume a default installation with hhblits binary at
# <PATH_TO_HHBLITS_INSTALL>/bin/hhblits
hh = hhblits3.HHblits(query_seq, '<PATH_TO_HHBLITS_INSTALL>')
# now create a search profile for the query sequence against uniclust30
# which you can get with instructions in the hh-suite user guide (github)
# <PATH_TO_DB>/uniclust30_2018_08 is just the prefix common to
# all db files, so `ls <PATH_TO_DB>/uniclust30_2018_08*` would list all
# of them
a3m_file = hh.BuildQueryMSA(nrdb='<PATH_TO_DB>/uniclust30_2018_08')
# lets load the data in the a3m_file and display the generated
# multiple sequence alignment note that ParseA3M is not a class method
# but a module function
a3m_data = hhblits3.ParseA3M(open(a3m_file))
print(a3m_data['msa'])
# search time! we just search against uniclust again, but every HHblits db is
# working here, e.g. one build from all the sequences in PDB
hit_file = hh.Search(a3m_file, '<PATH_TO_DB>/uniclust30_2018_08')
# lets have a look at the resuls
with open(hit_file) as hit_fh:
header, hits = hhblits3.ParseHHblitsOutput(hit_fh)
for hit in hits:
print(hit.aln)
# cleanup
hh.Cleanup()
Binding API¶
HHblits wrapper classes and functions.
-
class
HHblits
(query, hhsuite_root, hhblits_bin=None, working_dir=None)¶ Initialise a new HHblits “search” for the given query. Query may either be a
SequenceHandle
or a string. In the former case, the query is the actual query sequence, in the latter case, the query is the filename to the file containing the query.Parameters: - query (
SequenceHandle
orstr
) – Query sequence as file or sequence. - hhsuite_root (
str
) – Path to the top-level directory of your hhsuite installation. - hhblits_bin (
str
) – Name of the hhblits binary. Will only be used ifhhsuite_root
/bin/hhblits
does not exist. - working_dir (
str
) – Directory for temporary files. Will be created if not present but not automatically deleted.
-
A3MToCS
(a3m_file, cs_file=None, options={})¶ Converts the A3M alignment file to a column state sequence file. If cs_file is not given, the output file will be set to <
a3m_file
-basename>.seq219.If the file was already produced, the existing file path is returned without recomputing it.
Parameters: - a3m_file (
str
) – Path to input MSA as produced byBuildQueryMSA()
- cs_file (
str
) – Output file name (may be omitted) - options (
dict
) – Dictionary of options to cstranslate, one “-” is added in front of every key. Boolean True values add flag without value.
Returns: Path to the column state sequence file
Return type: str
- a3m_file (
-
A3MToProfile
(a3m_file, hhm_file=None)¶ Converts the A3M alignment file to a hhm profile. If hhm_file is not given, the output file will be set to <
a3m_file
-basename>.hhm.The produced HHM file can be parsed by
ParseHHM()
.If the file was already produced, the existing file path is returned without recomputing it.
Parameters: - a3m_file (
str
) – Path to input MSA as produced byBuildQueryMSA()
- hhm_file (
str
) – Desired output file name
Returns: Path to the profile file
Return type: str
- a3m_file (
-
AssignSSToA3M
(a3m_file)¶ HHblits does not assign predicted secondary structure by default. You can optionally assign it with the addss.pl script provided by the HH-suite. However, your HH-suite installation requires you to specify paths to PSIRED etc. We refer to the HH-suite user guide for further instructions.
Parameters: a3m_file ( str
) – Path to file you want to assign secondary structure to
-
BuildQueryMSA
(nrdb, options={}, a3m_file=None, assign_ss=True)¶ Builds the MSA for the query sequence.
The produced A3M file can be parsed by
ParseA3M()
. If the file was already produced, hhblits is not called again and the existing file path is returned (neglecting the assign_ss flag!!!).Parameters: - nrdb (
str
) – Database to be align against; has to be an hhblits database - options (
dict
) – Dictionary of options to hhblits, one “-” is added in front of every key. Boolean True values add flag without value. Merged with default options {‘cpu’: 1, ‘n’: 1, ‘e’: 0.001}, where ‘n’ defines the number of iterations and ‘e’ the E-value cutoff for inclusion of sequences in result alignment. - a3m_file (
str
) – a path of a3m_file to be used, optional - assign_ss (
bool
) – HHblits does not assign predicted secondary structure by default. You can optionally assign it with the addss.pl script provided by the HH-suite. However, your HH-suite installation requires you to specify paths to PSIRED etc. We refer to the HH-suite user guide for further instructions. Assignment is done by callingHHblits.AssignSSToA3M()
Returns: The path to the A3M file containing the MSA
Return type: str
- nrdb (
-
Cleanup
()¶ Delete temporary data.
Delete temporary data if no working dir was given. Controlled by
needs_cleanup
.
-
CleanupFailed
()¶ In case something went wrong, call to make sure everything is clean.
This will delete the working dir independently of
needs_cleanup
.
-
Search
(a3m_file, database, options={}, prefix='')¶ Searches for templates in the given database. Before running the search, the hhm file is copied. This makes it possible to launch several hhblits instances at once. Upon success, the filename of the result file is returned. This file may be parsed with
ParseHHblitsOutput()
.Parameters: - a3m_file (
str
) – Path to input MSA as produced byBuildQueryMSA()
- database (
str
) – Search database, needs to be the common prefix of the database files - options (
dict
) – Dictionary of options to hhblits, one “-” is added in front of every key. Boolean True values add flag without value. Merged with default options {‘cpu’: 1, ‘n’: 1}, where ‘n’ defines the number of iterations. - prefix (
str
) – Prefix to the result file
Returns: The path to the result file
Return type: str
- a3m_file (
- query (
-
class
HHblitsHit
(hit_id, aln, score, ss_score, evalue, pvalue, prob)¶ A hit found by HHblits
-
hit_id
¶ String identifying the hit
Type: str
-
aln
¶ Pairwise alignment containing the aligned part between the query and the target. First sequence is the query, the second sequence the target.
Type: AlignmentHandle
-
score
¶ The alignment score
Type: float
-
ss_score
¶ The secondary structure score
Type: float
-
evalue
¶ The E-value of the alignment
Type: float
-
pvalue
¶ The P-value of the alignment
Type: float
-
prob
¶ The probability of the alignment (between 0 and 100)
Type: float
-
-
class
HHblitsHeader
¶ Stats from the beginning of search output.
-
query
¶ The name of the query sequence
Type: str
-
match_columns
¶ Total of aligned Match columns
Type: int
-
n_eff
¶ Value of the
-neff
optionType: float
-
searched_hmms
¶ Number of profiles searched
Type: int
-
date
¶ Execution date
Type: datetime.datetime
-
command
¶ Command used to run
Type: str
-
-
ParseHHblitsOutput
(output)¶ Parses the HHblits output as produced by
HHblits.Search()
and returns the header of the search results and a list of hits.Parameters: output (iterable (e.g. an open file handle)) – Iterable containing the lines of the HHblits output file Returns: a tuple of the header of the search results and the hits Return type: ( HHblitsHeader
,list
ofHHblitsHit
)
-
ParseA3M
(a3m_file)¶ Parse secondary structure information and the multiple sequence alignment out of an A3M file as produced by
HHblits.BuildQueryMSA()
.Parameters: a3m_file (iterable (e.g. an open file handle)) – Iterable containing the lines of the A3M file Returns: Dictionary containing “ss_pred” ( list
), “ss_conf” (list
) and “msa” (AlignmentHandle
). If not available, “ss_pred” and “ss_conf” entries are set to None.
-
ParseHHM
(profile)¶ Parse secondary structure information and the MSA out of an HHM profile as produced by
HHblits.A3MToProfile()
.Parameters: profile ( file
) – Opened file handle holding the profile.Returns: Dictionary containing “ss_pred” ( list
), “ss_conf” (list
), “msa” (AlignmentHandle
) and “consensus” (SequenceHandle
). If not available, “ss_pred” and “ss_conf” entries are set to None.
-
ParseHeaderLine
(line)¶ Fetch header content.
First, we seek the start of the identifier, that is, the first whitespace after the hit number + 1. Since the identifier may contain whitespaces itself, we cannot split the whole line
Parameters: line ( str
) – Line from the output header.Returns: Hit information and query/template offsets Return type: ( HHblitsHit
, (int
,int
))