You are reading the documentation for version 2.3.1 of OpenStructure. You may also want to read the documentation for: 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.7.1 1.8 1.9 1.10 1.11 2.0 2.1 2.2 2.3 devel

# seq – Sequences and Alignments¶

The seq module helps you working with sequence data of various kinds. It has classes for single sequences, lists of sequences and alignments of two or more sequences.

## Attaching Structures to Sequences¶

As OpenStructure is a computational structural biology framework, it is not surprising that the sequence classes have been designed to work together with structural data. Each sequence can have an attached EntityView allowing for fast mapping between residues in the entity view and position in the sequence.

### Sequence Offset¶

When using sequences and structures together, often the start of the structure and the beginning of the sequence do not fall together. In the following case, the alignment of sequences B and C only covers a subsequence of structure A:

A acefghiklmnpqrstuvwy
B     ghiklm
C     123-45


We would now like to know which residue in protein A is aligned to which residue in sequence C. This is achieved by setting the sequence offset of sequence C to 4. In essence, the sequence offset influences all the mapping operations from position in the sequence to residue index and vice versa. By default, the sequence offset is 0.

The io module supports input and output of common sequence formats. Single sequences can be loaded from disk with LoadSequence(), alignments are loaded with LoadAlignment() and lists of sequences are loaded with LoadSequenceList(). In addition to the file based input methods, sequences can also be loaded from a string:

seq_string = '''>sequence
abcdefghiklmnop'''
s = io.SequenceFromString(seq_string, 'fasta')
print(s.name, s) # will print "sequence abcdefghiklmnop"


Note that, in that case specifying the format is mandatory.

## The SequenceHandle¶

CreateSequence(name, sequence, role="UNKNOWN")

Create a new SequenceHandle with the given name and sequence.

Parameters: name (str) – Name of the sequence sequence (str) – String of characters representing the sequence. Only ‘word’ characters (no digits), ‘?’, ‘-‘ and ‘.’ are allowed. In an upcoming release, ‘?’ and ‘.’ will also be forbidden so its best to translate those to ‘X’ or ‘-‘. role (str) – Role of the sequence (optional) InvalidSequence – When the sequence string contains forbidden characters. In the future, ‘?’ and ‘.’ will also raise this exception.
class SequenceHandle
class ConstSequenceHandle

Represents a sequence. New instances are created with CreateSequence().

GetPos(residue_index)

Get position of residue with index in sequence. This is best illustrated in the following example:

s=seq.CreateSequence("A", "abc---def")
print(s.GetPos(1)) # prints 1
print(s.GetPos(3)) # prints 6


The reverse mapping, that is from position in the sequence to residue index can be achieved with GetResidueIndex().

GetResidueIndex(pos)

Get residue index of character at given position. This method is the inverse of GetPos(). If the sequence contains a gap at that position, an Error is raised. Admires the sequence offset.

s=seq.CreateSequence("A", "abc--def")
print(s.GetResidueIndex(1)) # prints 1
print(s.GetResidueIndex(6)) # prints 4
# the following line raises an exception of type
# Error with the message "requested position contains
# a gap"
print(s.GetResidueIndex(3))

GetResidue(pos)

As, GetResidueIndex(), but directly returns the residue view. If no view is attached, or if the position is a gap, an invalid residue view is returned.

Return type: ResidueView
GetLastNonGap()

Get position of last non-gap character in sequence. In case of an empty sequence, or, a sequence only consisting of hyphens, -1 is returned

GetFirstNonGap()

Get position of first non-gap character in sequence. In case of an empty sequence, or, a sequence only consisting of hyphens, -1 is returned.

AttachView(view)
AttachView(view, chain_name)

Attach an EntityView to sequence. The first signature requires that the view contains one chain. If not, an IntegrityError is raised. The second signature will select the chain with the given name. If no such chain exists, an IntegrityError is raised.

HasAttachedView()

Returns True when the sequence has a view attached, False if not.

GetAttachedView()

Returns the attached EntityView, or an invalid EntityView if no view has been attached. Also available as the property attached_view.

GetName()

Returns the name of the sequence. Also available as the property name

SetOffset()

Set the sequence offset. By default, the offset is 0. Also available as the property offset.

GetOffset()

Returns the sequence offset. Also available as offset.

GetGaplessString()

Returns a string version of this sequence with all hyphens removed. Also available as the property gapless_string.

Normalise()

Remove ‘-‘ and ‘.’ as gaps from the sequence and make it all upper case. Works in place.

SetName()

Set name of the sequence. Also available as the property name.

GetOneLetterCode(pos)
__getitem__(pos)
__getitem__(slice)
Returns: Character at position pos of sequence (also supports pythonic slicing with [] operator) str
gapless_string

Shorthand for GetGaplessString()

name

Shorthand for GetName()/SetName()

attached_view

Shorthand for GetAttachedView().

offset

Shorthand for GetOffset()/SetOffset()

role

Role of this sequence.

Type: str
__len__()
Returns: The length of the sequence (including insertions and deletions)
__str__()
Returns: The sequence as a string.
Copy()

Create a deep copy of the sequence. The newly created sequence has the same attached view (not a deep copy of the view!).

SequenceFromChain(name, chain)
Returns: Sequence extracted from one letter codes in given chain with a view to the chain attached to it SequenceHandle name (str) – Name of the sequence chain (ChainHandle / ChainView) – Chain from which to extract sequence
Match(s1, s2)
Parameters: s1 (SequenceHandle, or str) – The first sequence s2 (SequenceHandle, or str) – The second sequence

Check whether the two sequences s1 and s2 match. This function performs are case-insensitive comparison of the two sequences. The character ‘X’ is interpreted as a wild card character that always matches the other sequence.

## The SequenceList¶

CreateSequenceList()

Creates and returns a new SequenceList with no sequences.

class SequenceList
class ConstSequenceList

Represents a list of sequences. The class provides a row-based interface.

GetCount()
__len__()
Returns: Number of sequences in the list. int
AddSequence(sequence)

Append a sequence to the list.

GetMinLength()
GetMaxLength()
Returns: Minimal / maximal length of the sequences in this list. int
FindSequence(name)

Find sequence with given name. If the alignment contains several sequences with the same name, the first sequence is returned.

SequencesHaveEqualLength()
Returns: True if all sequences have same length.
Take(n)
Returns: First n (or last -n if n negative) sequences.
Slice(first, n)
Returns: n sequences starting from first.
__getitem__(key)
Returns: Sequence(s) indexed by key (supports pythonic slicing).

## The AlignmentHandle¶

The AlignmentHandle represents a list of aligned sequences. In contrast to SequenceList, an alignment requires all sequences to be of the same length. New instances of alignments are created with CreateAlignment() and AlignmentFromSequenceList().

Typically sequence alignments are used column-based, i.e by looking at an aligned columns in the sequence alignment. To get a row-based (sequence) view on the sequence list, use GetSequences().

All functions that operate on an alignment will again produce a valid alignment. This mean that it is not possible to change the length of one sequence, without adjusting the other sequences, too.

The following example shows how to iterate over the columns and sequences of an alignment:

aln=io.LoadAlignment('aln.fasta')
# iterate over the columns
for col in aln:
print(col)

# iterate over the sequences
for s in aln.sequences:
print(s)

CreateAlignment()

Creates and returns a new AlignmentHandle with no sequences.

AlignmentFromSequenceList(sequences)

Create a new alignment from the given list of sequences

Parameters: sequences (ConstSequenceList) – the list of sequences InvalidAlignment if the sequences do not have the same length.
class AlignmentHandle
GetSequence(index)
Returns: Sequence at the given index, raising an IndexError when trying to access an inexistent sequence. ConstSequenceHandle
GetSequences()
Returns: List of all sequence of the alignment. Also available as sequences. ConstSequenceList
GetLength()
__len__()
Returns: Length of the alignment. int
GetCount()
Returns: Number of sequences in the alignment. Also available as sequence_count. int
ToString(width=80)
Returns: Formatted string version of the alignment. The sequences are split into smaller parts to fit into the number columns specified. str
aln=seq.CreateAlignment()
# The following command will print the output given below
print(aln.ToString(7))
# A abcde
# B 12345
#
# A fghik
# B 67890

FindSequence(name)
Returns: Sequence with given name. If the alignment contains several sequences with the same name, the first sequence is returned.
SetSequenceName(seq_index, name)

Set the name of the sequence at index seq_index to name (see SequenceHandle.name).

Copy()

Create a deep copy of the alignment by copying each contained sequence (see SequenceHandle.Copy())

GetPos(seq_index, res_index)
Returns: Position of residue with index equal to res_index in sequence at index seq_index (see SequenceHandle.GetPos())
GetResidueIndex(seq_index, pos)
Returns: Residue index of residue at position pos in sequence at index seq_index (see SequenceHandle.GetResidueIndex())
GetResidue(seq_index, pos)
Returns: Attached residue at position pos in sequence at index seq_index (see SequenceHandle.GetResidue()).
AttachView(seq_index, view)
AttachView(seq_index, view, chain_name)

Attach the given view to the sequence at index seq_index (see SequenceHandle.AttachView()).

Cut(start, end)

Removes the columns in the half-closed interval start, end from the alignment. Note that this function does not update offsets!

aln=seq.CreateAlignment()
aln.Cut(4, 7)

print(aln.ToString(80))
# will print
# A abcdhik
# B 1234890

Replace(new_region, start, end)

Replace the columns in the half-closed interval start, end with the columns in new_region.

Parameters: new_region (AlignedRegion or AlignmentHandle) – The region to be inserted
GetMatchingBackboneViews(index1=0, index2=1)

Returns a tuple of entity views containing matching backbone atoms for the two sequences at index1 and index2, respectively. For each aligned column in the alignment, backbone atoms are added to the view if both aligned residues have them. It is guaranteed that the two views contain the same number of atoms and that the order of the atoms in the two views is the same.

The output of this function can be used to superpose two structures with SuperposeSVD().

Parameters: index1 – The index of the first sequence index2 – The index of the second sequence. In case one of the two sequences doesn’t have an attached view, a RuntimeError is raised.
AddSequence(sequence)

Append a sequence to the alignment. The sequence must have the same length as sequences already present in the alignment.

Raises: RuntimeError if the sequence length does not match sequence (ConstSequenceHandle) – Sequence to be added
GetSequenceOffset(index)
SetSequenceOffset(index, offset)

Get/set the offset for sequence at index (see SequenceHandle.offset).

Parameters: index (int) – The index of the sequence offset (int) – The new offset int
GetSequenceRole(index)
SetSequenceRole(index, role)

Get/Set the sequence role for sequence at index (see SequenceHandle.role).

Parameters: index (int) – The index of the sequence role (str) – The new role str
GetCoverage(index)

Get coverage of sequence at index to the first sequence.

Parameters: index (int) – The index of the sequence Coverage as a number between 0 and 1.
RemoveSequence(index)

Remove sequence at index from the alignment.

sequences

Shorthand for GetSequences()

sequence_count

Shorthand for GetCount()

__getitem__(pos)
Returns: Column at position pos of alignment. AlignedColumn
__getitem__(slice)
Returns: Columns defined by by pythonic slicing. AlignedRegion
class AlignedRegion

Represents a slice of an AlignmentHandle.

GetAlignmentHandle()
Returns: Alignment from which we slices. AlignmentHandle
GetLength()
__len__()
Returns: Number of columns in the slice.
__getitem__(pos)
Returns: Column at position pos within this slice. AlignedColumn
start

Starting position in alignment.

end

One after end position in alignment.

class AlignedColumn
GetIndex()
Returns: Position in alignment.
GetRowCount()
Returns: Number of rows in the column.
GetResidue(row)
Returns: Attached residue for sequence at given row of this column (see AlignmentHandle.GetResidue()).
__getitem__(row)
Returns: Character at given row of this column. str
__str__()
Returns: String representation of column in alignment.

## Extracting views from sequences¶

ViewsFromSequences(seq1, seq2)

Returns a tuple of entity views containing only the atoms of the aligned residues. The order of residues in the two views is guaranteed to be the same but the order of atoms within each residue may differ. If the order of atoms is crucial (e.g. for SuperposeSVD()) either prefilter the attached views to include only one atom per residue or use the slower (approx. 50% more runtime) AlignmentHandle.GetMatchingBackboneViews().

Returns: Pair of views including all the aligned residues of the two given sequences. An alignment is tuple with two EntityView Exception if sequence lengths do not match or if any of the sequences is lacking an attached view.
ViewsFromAlignment(aln, index1=0, index2=1)
Returns: Pair of views as in ViewsFromSequences(). tuple with two EntityView aln (AlignmentHandle) – Alignment from which to extract sequences. index1 (int) – Index of first sequence in aln to use. index2 (int) – Index of second sequence in aln to use.

## Handling Sequence Profiles¶

The ProfileHandle provides a simple container for profiles for each residue. It mainly contains:

Optionally, HMM-related information can be added. This is transition probabilities between Match, Insertion or Deletion states or neff values (number of effective sequences, a measure of local sequence diversity).

class HMMTransition

The possible HMM-transitions between Match(M), Insertion(I) and Deletion(D) states. Transitions between Deletion and Insertion are disallowed:

HMM_M2M, HMM_M2I, HMM_M2D, HMM_I2M, HMM_I2I, HMM_D2M, HMM_D2D

class HMMData

Data container for HMM-related information that can be assigned to profile columns.

neff

Local diversity of all sequences that have a residue at this column of the full alignment.

neff_i

Local diversity of all sequences that have an insertion at this column of the full alignment.

neff_d

Same for deletion.

GetProb(transition)

Get HMM transition probability

Parameters: transition (HMMTransition) – The transition
SetProb(transition, prob)

Set HMM transition probability

Parameters: transition (HMMTransition) – The transition prob (float) – The probablity to be set
class ProfileColumn
BLOSUMNullModel()

Static method, that returns a new ProfileColumn with amino acid frequencies given from the BLOSUM62 substitution matrix.

HHblitsNullModel()

Static method, that returns a new ProfileColumn with amino acid frequencies as set in HHblits output.

GetFreq(aa)
Returns: Frequency of aa float aa (str) – One letter code of standard amino acid
SetFreq(aa, freq)
Parameters: aa (str) – One letter code of standard amino acid freq (float) – The frequency of the given amino acid
GetScore(other, null_model)
Returns: Column score as in Soeding-2005 paper. float other (ProfileColumn) – Other column to compute score with. null_model (ProfileColumn) – Null model to use for weighting.
SetHMMData(data)
Parameters: data (HMMData) – Data to be set
GetHMMData()

Returns previously set HMMData object.

Return type: HMMData Error if data has never been set.
entropy

Shannon entropy based on the columns amino acid frequencies

Type: float
class ProfileHandle
__len__()

Returns the length of the sequence for which we have profile.

Return type: int
AddColumn(col, olc='X')

Appends column in the internal column list.

Parameters: col (str) – Column to add to columns olc – One letter code to add to sequence
Extract(from, to)
Parameters: from (int) – Col Idx to start from to (int) – End Idx, not included in sub-ProfileHandle sub-profile as defined by given indices (null_model and neff are copied, you might want to manually reset neff) ProfileHandle Error if to <= from or to > __len__().
GetAverageScore(other, offset=0)
Returns: Average column score between other.columns[i] and this object’s columns[i+offset] for i in [0, len(other)-1] using this object’s null_model. See ProfileColumn.GetScore(). float other (ProfileHandle) – Other profile to compare with. offset (int) – Start comparison at column offset of this object. Error if any columns[i+offset] out of bounds.
sequence

Sequence for which we have this profile. When setting a new value, the length and the number of profile columns must match (exception thrown otherwise).

Type: str
columns

Iterable columns of the profile (read-only).

Type: ProfileColumnList
null_model

Null model of the profile. By default this is set to ProfileColumn.HHblitsNullModel().

avg_entropy

Average entropy of all the columns (read-only).

Type: float
neff

Measure for sequence diversity which is defined as the average of the per-column neff values. However, this is just a convenience attribute which can be set to arbitrary values but there is no guarantee that it’s the actual average of the per-column values.

class ProfileDB

A simple database to gather ProfileHandle objects. It is possible to save them to disk in a compressed format with limited accuracy (4 digits for each frequency).

Save(filename)
Parameters: filename (str) – Name of file that will be generated on disk.
Load(filename)

Parameters: filename (str) – Name of file from which the database should be loaded. The loaded database
AddProfile(name, prof)
Parameters: name (str) – Name of profile to be added prof (ProfileHandle) – Profile to be added Exception when filename is longer than 255 characters.
GetProfile(name)
Parameters: name (str) – Name of profile to be returned The requested ProfileHandle Exception when no ProfileHandle for name exists.
Size()
Returns: Number of ProfileHandle objects in the database
GetNames()
Returns: A nonsorted list of the names of all ProfileHandle objects in the database

## Search

Enter search terms or a module, class or function name.

## Previous topic

img.alg - Image Processing Algorithms

## Next topic

seq.alg – Algorithms for Sequences