hhblits
- Search related sequences in databases¶
Introduction¶
HHblits is a sequence search tool like BLAST, able to find more distant homologs This is achieved by performing profile-profile searches. In BLAST, a query sequence is searched against a sequence database. That is a sequence-sequence search. HHblits works on a profile database, usually that one is provided, queried with a sequence profile. The latter one needs to be calculated before the actual search. In very simple words, HHblits is using per-sequence scoring functions to be more sensitive, in this particular case Hidden Markov models. The software suite needed for HHblits can be found here.
Examples¶
A typical search: Get an instance of the bindings, build the search profile out
of the query sequence, run the search and iterate results. Since
HHblits
works with a SequenceHandle
or a FastA
file, both ways are shown.
First query by sequence:
from ost.bindings import hhblits
# get a SequenceHandle
query_seq = seq.CreateSequence('Query',
'MRIILLGAPGAGKGTQAQFIMEKYGIPQISTGDMLRAAVKSGS'+
'ELGKQAKDIMDAGKLVTDELVIALVKERIAQEDCRNGFLLDGF'+
'PRTIPQADAMKEAGINVDYVLEFDVPDELIVDRIVGRRVHAPS'+
'GRVYHVKFNPPKVEGKDDVTGEELTTRKDDQEETVRKRLVEYH'+
'QMTAPLIGYYYYSKEAEAGNTKYAKVDGTKPVAEVRADLEKILG')
# set up the search environment
# $EBROOTHHMINSUITE points to the root of a HHsuite installation
hh = hhblits.HHblits(query_seq, os.getenv('EBROOTHHMINSUITE'))
# now create a search profile for the query sequence against the NR20 db
# provided on the HHblits web page, nr20_12Aug11 is just the prefix common to
# all db files, so `ls hhtools/nr20_12Aug11*` would list all of them
a3m_file = hh.BuildQueryMSA(nrdb='hhtools/nr20_12Aug11')
# search time! we just search against NR20 again, but every HHblits db is
# working here, e.g. one build from all the sequences in PDB
hit_file = hh.Search(a3m_file, 'hhtools/nr20_12Aug11')
# lets have a look at the resuls
with open(hit_file) as hit_fh:
header, hits = hhblits.ParseHHblitsOutput(hit_fh)
for hit in hits:
print hit.aln
Very similar going by file:
from ost.bindings import hhblits
# set up the search environment
# $EBROOTHHMINSUITE points to the root of a HHsuite installation
hh = hhblits.HHblits('query.fas', os.getenv('EBROOTHHMINSUITE'))
# now create a search profile for the query sequence against the NR20 db
# provided on the HHblits web page, nr20_12Aug11 is just the prefix common to
# all db files, so `ls hhtools/nr20_12Aug11*` would list all of them
a3m_file = hh.BuildQueryMSA(nrdb='hhtools/nr20_12Aug11')
# search time! we just search against NR20 again, but every HHblits db is
# working here, e.g. one build from all the sequences in PDB
hit_file = hh.Search(a3m_file, 'hhtools/nr20_12Aug11')
# lets have a look at the resuls
with open(hit_file) as hit_fh:
header, hits = hhblits.ParseHHblitsOutput(hit_fh)
for hit in hits:
print hit.aln
The alignments produced by HHblits are sometimes slightly better than by BLAST, so one may want to extract them:
from ost.bindings import hhblits
# set up the search environment
# $EBROOTHHMINSUITE points to the root of a HHsuite installation
hh = hhblits.HHblits('query.fas', os.getenv('EBROOTHHMINSUITE'))
# now create a search profile for the query sequence against the NR20 db
# provided on the HHblits web page, nr20_12Aug11 is just the prefix common to
# all db files, so `ls hhtools/nr20_12Aug11*` would list all of them
a3m_file = hh.BuildQueryMSA(nrdb='hhtools/nr20_12Aug11')
# note that ParseA3M is not a class method but a module function
output = hhblits.ParseA3M(open(a3m_file))
print output['msa']
Binding API¶
-
class
HHblits
(query, hhsuite_root, hhblits_bin=None, working_dir=None)¶ Initialise a new HHblits “search” for the given query. Query may either be a
SequenceHandle
or a string. In the former case, the query is the actual query sequence, in the latter case, the query is the filename to the file containing the query.Parameters: - query (
SequenceHandle
orstr
) – Query sequence as file or sequence. - hhsuite_root (
str
) – Path to the top-level directory of your hhsuite installation. - hhblits_bin (
str
) – Name of the hhblits binary. Will only be used ifhhsuite_root
/bin/hhblits
does not exist. - working_dir (
str
) – Directory for temporary files. Will be created if not present but not automatically deleted.
-
A3MToCS
(a3m_file, cs_file=None, options={})¶ Converts the A3M alignment file to a column state sequence file. If cs_file is not given, the output file will be set to <
a3m_file
-basename>.seq219.Parameters: - a3m_file (
str
) – A3M file to be converted - cs_file (
str
) – output file name (may be omitted) - options (
dict
) – dictionary of options to cstranslate, must come with the right amount of ‘-‘ in front.
Returns: the path to the column state sequence file
Return type: str
- a3m_file (
-
A3MToProfile
(a3m_file, hhm_file=None)¶ Converts the A3M alignment file to a hhm profile. If hhm_file is not given, the output file will be set to <
a3m_file
-basename>.hhm.Parameters: - a3m_file (
str
) – input MSA - hhm_file (
str
) – output file name
Returns: the path to the profile
Return type: str
- a3m_file (
-
BuildQueryMSA
(nrdb, iterations=1, mact=None, cpu=1)¶ Builds the MSA for the query sequence
This function directly uses hhblits of hhtools. While in theory it would be possible to do this by PSI-blasting on our own, hhblits is supposed to be faster. Also it is supposed to prevent alignment corruption. The alignment corruption is caused by low-scoring terminal alignments that draw the sequences found by PSI-blast away from the optimum. By removing these low scoring ends, part of the alignment corruption can be suppressed. hhblits does not call PSIPRED on the MSA to predict the secondary structure of the query sequence. This is done by addss.pl of hhtools. The predicted secondary structure is stored together with the sequences identified by hhblits.
Parameters: - nrdb (
str
) – Database to be align against; has to be an hhblits database - iterations (
int
) – Number of hhblits iterations - mact (
float
) –-mact
of hhblits - cpu (
int
) –-cpu
of hhblits
Returns: the path to the MSA file
Return type: str
- nrdb (
-
Cleanup
()¶ Delete temporary data.
Delete temporary data if no working dir was given. Controlled by
needs_cleanup
.
-
CleanupFailed
()¶ In case something went wrong, call to make sure everything is clean.
-
Search
(a3m_file, database, options={}, prefix='')¶ Searches for templates in the given database. Before running the search, the hhm file is copied. This makes it possible to launch several hhblits instances at once. Upon success, the filename of the result file is returned. This file may be parsed with
ParseHHblitsOutput()
.Parameters: - a3m_file (
str
) – input MSA file - database (
str
) – search database, needs to be the common prefix of the database files - options (
dict
) – dictionary of options, must come with the right amount of ‘-‘ in front. - prefix (
str
) – prefix to the result file
Returns: the path to the result file
Return type: str
- a3m_file (
- query (
-
class
HHblitsHit
(hit_id, aln, score, ss_score, evalue, pvalue, prob)¶ A hit found by HHblits
-
hit_id
¶ String identifying the hit
Type: str
-
aln
¶ Pairwise alignment containing the aligned part between the query and the target. First sequence is the query, the second sequence the target.
Type: AlignmentHandle
-
score
¶ The alignment score
Type: float
-
ss_score
¶ The secondary structure score
Type: float
-
evalue
¶ The E-value of the alignment
Type: float
-
pvalue
¶ The P-value of the alignment
Type: float
-
prob
¶ The probability of the alignment (between 0 and 100)
Type: float
-
-
class
HHblitsHeader
¶ Stats from the beginning of search output.
-
query
¶ The name of the query sequence
Type: str
-
match_columns
¶ Total of aligned Match columns
Type: int
-
n_eff
¶ Value of the
-neff
optionType: float
-
searched_hmms
¶ Number of profiles searched
Type: int
-
date
¶ Execution date
Type: datetime.datetime
-
command
¶ Command used to run
Type: str
-
-
ParseHHblitsOutput
(output)¶ Parses the HHblits output and returns a tuple of
HHblitsHeader
and a list ofHHblitsHit
instances.Parameters: output ( file
/ iteratable) – output of aHHblits.Search()
, needs to be iteratable, e.g. an open file handleReturns: a tuple of the header of the search results and the hits Return type: ( HHblitsHeader
,HHblitsHit
)
-
ParseA3M
(a3m_file)¶ Parse secondary structure information and the multiple sequence alignment out of an A3M file.
Parameters: a3m_file (iteratable, e.g. an opened file) – Iteratable containing the lines of the A3M file Returns: Dictionary containing “ss_pred” ( list
), “ss_conf” (list
) and “msa” (AlignmentHandle
).
-
ParseHeaderLine
(line)¶ Fetch header content.
First, we seek the start of the identifier, that is, the first whitespace after the hit number + 1. Since the identifier may contain whitespaces itself, we cannot split the whole line
Parameters: line ( str
) – Line from the output header.Returns: Hit information Return type: HHblitsHit
-
ParseHHM
(profile)¶ Parse secondary structure information and the MSA out of an HHM profile.
Parameters: profile ( file
) – Opened file handle holding the profile.Returns: Dictionary containing “ss_pred” ( list
), “ss_conf” (list
), “msa” (AlignmentHandle
) and “consensus” (~ost.seq.SequenceHandle).
-
EstimateMemConsumption
()¶ Estimate the memory needed by HHblits. By default it uses not more than 3G. Also for small sequences it already uses quite some memnmory (46AA, 1.48G). And since the memory consumption could depend on the iterative search runs, how many hits are found in each step, we just go with 4G, here.
Returns: Assumed memory consumtion Return type: ( float
,str
)