hhblits
- Search related sequences in databases¶
Introduction¶
HHblits is a sequence search tool like BLAST, able to find more distant homologs This is achieved by performing profile-profile searches. In BLAST, a query sequence is searched against a sequence database. That is a sequence-sequence search. HHblits works on a profile database, usually that one is provided, queried with a sequence profile. The latter one needs to be calculated before the actual search. In very simple words, HHblits is using per-sequence scoring functions to be more sensitive, in this particular case Hidden Markov models. The software suite needed for HHblits can be found here.
Examples¶
A typical search: Get an instance of the bindings, build the search profile out
of the query sequence, run the search and iterate results. Since
HHblits
works with a SequenceHandle
or a FastA
file, both ways are shown.
First query by sequence:
from ost.bindings import hhblits
# get a SequenceHandle
query_seq = seq.CreateSequence('Query',
'MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGS'+
'ELGKQAKDIMDAGKLVTDELVIALVKERIAQEDCRNGFLLDGF'+
'PRTIPQADAMKEAGINVDYVLEFDVPDELIVDRIVGRRVHAPS'+
'GRVYHVKFNPPKVEGKDDVTGEELTTRKDDQEETVRKRLVEYH'+
'QMTAPLIGYYYYSKEAEAGNTKYAKVDGTKPVAEVRADLEKILG')
# set up the search environment
# $EBROOTHHMINSUITE points to the root of a HHsuite installation
hh = hhblits.HHblits(query_seq, os.getenv('EBROOTHHMINSUITE'))
# now create a search profile for the query sequence against the NR20 db
# provided on the HHblits web page, nr20_12Aug11 is just the prefix common to
# all db files, so `ls hhtools/nr20_12Aug11*` would list all of them
a3m_file = hh.BuildQueryMSA(nrdb='hhtools/nr20_12Aug11')
# search time! we just search against NR20 again, but every HHblits db is
# working here, e.g. one build from all the sequences in PDB
hit_file = hh.Search(a3m_file, 'hhtools/nr20_12Aug11')
# lets have a look at the resuls
with open(hit_file) as hit_fh:
header, hits = hhblits.ParseHHblitsOutput(hit_fh)
for hit in hits:
print(hit.aln)
# cleanup
hh.Cleanup()
Very similar going by file:
from ost.bindings import hhblits
# set up the search environment
# $EBROOTHHMINSUITE points to the root of a HHsuite installation
hh = hhblits.HHblits('query.fas', os.getenv('EBROOTHHMINSUITE'))
# now create a search profile for the query sequence against the NR20 db
# provided on the HHblits web page, nr20_12Aug11 is just the prefix common to
# all db files, so `ls hhtools/nr20_12Aug11*` would list all of them
a3m_file = hh.BuildQueryMSA(nrdb='hhtools/nr20_12Aug11')
# search time! we just search against NR20 again, but every HHblits db is
# working here, e.g. one build from all the sequences in PDB
hit_file = hh.Search(a3m_file, 'hhtools/nr20_12Aug11')
# lets have a look at the resuls
with open(hit_file) as hit_fh:
header, hits = hhblits.ParseHHblitsOutput(hit_fh)
for hit in hits:
print(hit.aln)
# cleanup
hh.Cleanup()
The alignments produced by HHblits are sometimes slightly better than by BLAST, so one may want to extract them:
from ost.bindings import hhblits
# set up the search environment
# $EBROOTHHMINSUITE points to the root of a HHsuite installation
hh = hhblits.HHblits('query.fas', os.getenv('EBROOTHHMINSUITE'))
# now create a search profile for the query sequence against the NR20 db
# provided on the HHblits web page, nr20_12Aug11 is just the prefix common to
# all db files, so `ls hhtools/nr20_12Aug11*` would list all of them
a3m_file = hh.BuildQueryMSA(nrdb='hhtools/nr20_12Aug11')
# note that ParseA3M is not a class method but a module function
output = hhblits.ParseA3M(open(a3m_file))
print(output['msa'])
# cleanup
hh.Cleanup()
Binding API¶
HHblits wrapper classes and functions.
-
class
HHblits
(query, hhsuite_root, hhblits_bin=None, working_dir=None)¶ Initialise a new HHblits “search” for the given query. Query may either be a
SequenceHandle
or a string. In the former case, the query is the actual query sequence, in the latter case, the query is the filename to the file containing the query.Parameters: - query (
SequenceHandle
orstr
) – Query sequence as file or sequence. - hhsuite_root (
str
) – Path to the top-level directory of your hhsuite installation. - hhblits_bin (
str
) – Name of the hhblits binary. Will only be used ifhhsuite_root
/bin/hhblits
does not exist. - working_dir (
str
) – Directory for temporary files. Will be created if not present but not automatically deleted.
-
A3MToCS
(a3m_file, cs_file=None, options={})¶ Converts the A3M alignment file to a column state sequence file. If cs_file is not given, the output file will be set to <
a3m_file
-basename>.seq219.If the file was already produced, the existing file path is returned without recomputing it.
Parameters: - a3m_file (
str
) – Path to input MSA as produced byBuildQueryMSA()
- cs_file (
str
) – Output file name (may be omitted) - options (
dict
) – Dictionary of options to cstranslate, one “-” is added in front of every key. Boolean True values add flag without value.
Returns: Path to the column state sequence file
Return type: str
- a3m_file (
-
A3MToProfile
(a3m_file, hhm_file=None)¶ Converts the A3M alignment file to a hhm profile. If hhm_file is not given, the output file will be set to <
a3m_file
-basename>.hhm.The produced HHM file can be parsed by
ParseHHM()
.If the file was already produced, the existing file path is returned without recomputing it.
Parameters: - a3m_file (
str
) – Path to input MSA as produced byBuildQueryMSA()
- hhm_file (
str
) – Desired output file name
Returns: Path to the profile file
Return type: str
- a3m_file (
-
BuildQueryMSA
(nrdb, options={}, a3m_file=None)¶ Builds the MSA for the query sequence.
This function directly uses hhblits of hhtools. While in theory it would be possible to do this by PSI-blasting on our own, hhblits is supposed to be faster. Also it is supposed to prevent alignment corruption. The alignment corruption is caused by low-scoring terminal alignments that draw the sequences found by PSI-blast away from the optimum. By removing these low scoring ends, part of the alignment corruption can be suppressed.
hhblits does not call PSIPRED on the MSA to predict the secondary structure of the query sequence. This is done by addss.pl of hhtools. The predicted secondary structure is stored together with the sequences identified by hhblits.
The produced A3M file can be parsed by
ParseA3M()
. If the file was already produced, hhblits is not called again and the existing file path is returned.Parameters: - nrdb (
str
) – Database to be align against; has to be an hhblits database - options (
dict
) – Dictionary of options to hhblits, one “-” is added in front of every key. Boolean True values add flag without value. Merged with default options {‘cpu’: 1, ‘n’: 1}, where ‘n’ defines the number of iterations. - a3m_file (
str
) – a path of a3m_file to be used, optional
Returns: The path to the A3M file containing the MSA
Return type: str
- nrdb (
-
Cleanup
()¶ Delete temporary data.
Delete temporary data if no working dir was given. Controlled by
needs_cleanup
.
-
CleanupFailed
()¶ In case something went wrong, call to make sure everything is clean.
This will delete the working dir independently of
needs_cleanup
.
-
Search
(a3m_file, database, options={}, prefix='')¶ Searches for templates in the given database. Before running the search, the hhm file is copied. This makes it possible to launch several hhblits instances at once. Upon success, the filename of the result file is returned. This file may be parsed with
ParseHHblitsOutput()
.Parameters: - a3m_file (
str
) – Path to input MSA as produced byBuildQueryMSA()
- database (
str
) – Search database, needs to be the common prefix of the database files - options (
dict
) – Dictionary of options to hhblits, one “-” is added in front of every key. Boolean True values add flag without value. Merged with default options {‘cpu’: 1, ‘n’: 1}, where ‘n’ defines the number of iterations. - prefix (
str
) – Prefix to the result file
Returns: The path to the result file
Return type: str
- a3m_file (
- query (
-
class
HHblitsHit
(hit_id, aln, score, ss_score, evalue, pvalue, prob)¶ A hit found by HHblits
-
hit_id
¶ String identifying the hit
Type: str
-
aln
¶ Pairwise alignment containing the aligned part between the query and the target. First sequence is the query, the second sequence the target.
Type: AlignmentHandle
-
score
¶ The alignment score
Type: float
-
ss_score
¶ The secondary structure score
Type: float
-
evalue
¶ The E-value of the alignment
Type: float
-
pvalue
¶ The P-value of the alignment
Type: float
-
prob
¶ The probability of the alignment (between 0 and 100)
Type: float
-
-
class
HHblitsHeader
¶ Stats from the beginning of search output.
-
query
¶ The name of the query sequence
Type: str
-
match_columns
¶ Total of aligned Match columns
Type: int
-
n_eff
¶ Value of the
-neff
optionType: float
-
searched_hmms
¶ Number of profiles searched
Type: int
-
date
¶ Execution date
Type: datetime.datetime
-
command
¶ Command used to run
Type: str
-
-
ParseHHblitsOutput
(output)¶ Parses the HHblits output as produced by
HHblits.Search()
and returns the header of the search results and a list of hits.Parameters: output (iterable (e.g. an open file handle)) – Iterable containing the lines of the HHblits output file Returns: a tuple of the header of the search results and the hits Return type: ( HHblitsHeader
,list
ofHHblitsHit
)
-
ParseA3M
(a3m_file)¶ Parse secondary structure information and the multiple sequence alignment out of an A3M file as produced by
HHblits.BuildQueryMSA()
.Parameters: a3m_file (iterable (e.g. an open file handle)) – Iterable containing the lines of the A3M file Returns: Dictionary containing “ss_pred” ( list
), “ss_conf” (list
) and “msa” (AlignmentHandle
).
-
ParseHHM
(profile)¶ Parse secondary structure information and the MSA out of an HHM profile as produced by
HHblits.A3MToProfile()
.Parameters: profile ( file
) – Opened file handle holding the profile.Returns: Dictionary containing “ss_pred” ( list
), “ss_conf” (list
), “msa” (AlignmentHandle
) and “consensus” (SequenceHandle
).
-
ParseHeaderLine
(line)¶ Fetch header content.
First, we seek the start of the identifier, that is, the first whitespace after the hit number + 1. Since the identifier may contain whitespaces itself, we cannot split the whole line
Parameters: line ( str
) – Line from the output header.Returns: Hit information and query/template offsets Return type: ( HHblitsHit
, (int
,int
))