seq – Sequences and Alignments

The seq module helps you working with sequence data of various kinds. It has classes for single sequences, lists of sequences and alignments of two or more sequences.

Attaching Structures to Sequences

As OpenStructure is a computational structural biology framework, it is not surprising that the sequence classes have been designed to work together with structural data. Each sequence can have an attached EntityView allowing for fast mapping between residues in the entity view and position in the sequence.

Sequence Offset

When using sequences and structures together, often the start of the structure and the beginning of the sequence do not fall together. In the following case, the alignment of sequences B and C only covers a subsequence of structure A:

A acefghiklmnpqrstuvwy
B     ghiklm
C     123-45

We would now like to know which residue in protein A is aligned to which residue in sequence C. This is achieved by setting the sequence offset of sequence C to 4. In essence, the sequence offset influences all the mapping operations from position in the sequence to residue index and vice versa. By default, the sequence offset is 0.

Loading and Saving Sequences and Alignments

The io module supports input and output of common sequence formats. Single sequences can be loaded from disk with LoadSequence(), alignments are loaded with LoadAlignment() and lists of sequences are loaded with LoadSequenceList(). In addition to the file based input methods, sequences can also be loaded from a string:

seq_string = '''>sequence
abcdefghiklmnop'''
s = io.SequenceFromString(seq_string, 'fasta')
print(s.name, s) # will print "sequence abcdefghiklmnop"

Note that, in that case specifying the format is mandatory.

The SequenceHandle

CreateSequence(name, sequence, role='UNKNOWN')

Create a new SequenceHandle with the given name and sequence.

Parameters:
  • name (str) – Name of the sequence

  • sequence (str) – String of characters representing the sequence. Only ‘word’ characters (no digits), ‘?’, ‘-’ and ‘.’ are allowed. In an upcoming release, ‘?’ and ‘.’ will also be forbidden so its best to translate those to ‘X’ or ‘-‘.

  • role (str) – Role of the sequence (optional)

Raises:

InvalidSequence – When the sequence string contains forbidden characters. In the future, ‘?’ and ‘.’ will also raise this exception.

class SequenceHandle
class ConstSequenceHandle

Represents a sequence. New instances are created with CreateSequence().

GetPos(residue_index)

Get position of residue with index in sequence. This is best illustrated in the following example:

s=seq.CreateSequence("A", "abc---def")
print(s.GetPos(1)) # prints 1
print(s.GetPos(3)) # prints 6

The reverse mapping, that is from position in the sequence to residue index can be achieved with GetResidueIndex().

GetResidueIndex(pos)

Get residue index of character at given position. This method is the inverse of GetPos(). If the sequence contains a gap at that position, an Error is raised. Admires the sequence offset.

s=seq.CreateSequence("A", "abc--def")
print(s.GetResidueIndex(1)) # prints 1
print(s.GetResidueIndex(6)) # prints 4
# the following line raises an exception of type
# Error with the message "requested position contains
# a gap"
print(s.GetResidueIndex(3))
GetResidue(pos)

As, GetResidueIndex(), but directly returns the residue view. If no view is attached, or if the position is a gap, an invalid residue view is returned.

Return type:

ResidueView

GetLastNonGap()

Get position of last non-gap character in sequence. In case of an empty sequence, or, a sequence only consisting of hyphens, -1 is returned

GetFirstNonGap()

Get position of first non-gap character in sequence. In case of an empty sequence, or, a sequence only consisting of hyphens, -1 is returned.

AttachView(view)
AttachView(view, chain_name)

Attach an EntityView to sequence. The first signature requires that the view contains one chain. If not, an IntegrityError is raised. The second signature will select the chain with the given name. If no such chain exists, an IntegrityError is raised.

HasAttachedView()

Returns True when the sequence has a view attached, False if not.

GetAttachedView()

Returns the attached EntityView, or an invalid EntityView if no view has been attached. Also available as the property attached_view.

GetName()

Returns the name of the sequence. Also available as the property name

SetOffset()

Set the sequence offset. By default, the offset is 0. Also available as the property offset.

GetOffset()

Returns the sequence offset. Also available as offset.

GetGaplessString()

Returns a string version of this sequence with all hyphens removed. Also available as the property gapless_string.

Normalise()

Remove ‘-’ and ‘.’ as gaps from the sequence and make it all upper case. Works in place.

SetName()

Set name of the sequence. Also available as the property name.

GetOneLetterCode(pos)
__getitem__(pos)
__getitem__(slice)
Returns:

Character at position pos of sequence (also supports pythonic slicing with [] operator)

Return type:

str

gapless_string

Shorthand for GetGaplessString()

name

Shorthand for GetName()/SetName()

attached_view

Shorthand for GetAttachedView().

offset

Shorthand for GetOffset()/SetOffset()

role

Role of this sequence.

Type:

str

__len__()
Returns:

The length of the sequence (including insertions and deletions)

__str__()
Returns:

The sequence as a string.

Copy()

Create a deep copy of the sequence. The newly created sequence has the same attached view (not a deep copy of the view!).

SequenceFromChain(name, chain)
Returns:

Sequence extracted from one letter codes in given chain with a view to the chain attached to it

Return type:

SequenceHandle

Parameters:
  • name (str) – Name of the sequence

  • chain (ChainHandle / ChainView) – Chain from which to extract sequence

Match(s1, s2)
Parameters:

Check whether the two sequences s1 and s2 match. This function performs are case-insensitive comparison of the two sequences. The character ‘X’ is interpreted as a wild card character that always matches the other sequence.

The SequenceList

CreateSequenceList()

Creates and returns a new SequenceList with no sequences.

class SequenceList
class ConstSequenceList

Represents a list of sequences. The class provides a row-based interface.

GetCount()
__len__()
Returns:

Number of sequences in the list.

Return type:

int

AddSequence(sequence)

Append a sequence to the list.

GetMinLength()
GetMaxLength()
Returns:

Minimal / maximal length of the sequences in this list.

Return type:

int

FindSequence(name)

Find sequence with given name. If the alignment contains several sequences with the same name, the first sequence is returned.

SequencesHaveEqualLength()
Returns:

True if all sequences have same length.

Take(n)
Returns:

First n (or last -n if n negative) sequences.

Slice(first, n)
Returns:

n sequences starting from first.

__getitem__(key)
Returns:

Sequence(s) indexed by key (supports pythonic slicing).

The AlignmentHandle

The AlignmentHandle represents a list of aligned sequences. In contrast to SequenceList, an alignment requires all sequences to be of the same length. New instances of alignments are created with CreateAlignment() and AlignmentFromSequenceList().

Typically sequence alignments are used column-based, i.e by looking at an aligned columns in the sequence alignment. To get a row-based (sequence) view on the sequence list, use GetSequences().

All functions that operate on an alignment will again produce a valid alignment. This mean that it is not possible to change the length of one sequence, without adjusting the other sequences, too.

The following example shows how to iterate over the columns and sequences of an alignment:

aln=io.LoadAlignment('aln.fasta')
# iterate over the columns
for col in aln:
  print(col)

# iterate over the sequences
for s in aln.sequences:
  print(s)
CreateAlignment()

Creates and returns a new AlignmentHandle with no sequences.

AlignmentFromSequenceList(sequences)

Create a new alignment from the given list of sequences

Parameters:

sequences (ConstSequenceList) – the list of sequences

Raises:

InvalidAlignment if the sequences do not have the same length.

class AlignmentHandle
GetSequence(index)
Returns:

Sequence at the given index, raising an IndexError when trying to access an inexistent sequence.

Return type:

ConstSequenceHandle

GetSequences()
Returns:

List of all sequence of the alignment. Also available as sequences.

Return type:

ConstSequenceList

GetLength()
__len__()
Returns:

Length of the alignment.

Return type:

int

GetCount()
Returns:

Number of sequences in the alignment. Also available as sequence_count.

Return type:

int

ToString(width=80)
Returns:

Formatted string version of the alignment. The sequences are split into smaller parts to fit into the number columns specified.

Return type:

str

aln=seq.CreateAlignment()
aln.AddSequence(seq.CreateSequence("A", "abcdefghik"))
aln.AddSequence(seq.CreateSequence("B", "1234567890"))
# The following command will print the output given below
print(aln.ToString(7))
# A abcde
# B 12345
#
# A fghik
# B 67890
FindSequence(name)
Returns:

Sequence with given name. If the alignment contains several sequences with the same name, the first sequence is returned.

SetSequenceName(seq_index, name)

Set the name of the sequence at index seq_index to name (see SequenceHandle.name).

Copy()

Create a deep copy of the alignment by copying each contained sequence (see SequenceHandle.Copy())

GetPos(seq_index, res_index)
Returns:

Position of residue with index equal to res_index in sequence at index seq_index (see SequenceHandle.GetPos())

GetResidueIndex(seq_index, pos)
Returns:

Residue index of residue at position pos in sequence at index seq_index (see SequenceHandle.GetResidueIndex())

GetResidue(seq_index, pos)
Returns:

Attached residue at position pos in sequence at index seq_index (see SequenceHandle.GetResidue()).

AttachView(seq_index, view)
AttachView(seq_index, view, chain_name)

Attach the given view to the sequence at index seq_index (see SequenceHandle.AttachView()).

Cut(start, end)

Removes the columns in the half-closed interval start, end from the alignment. Note that this function does not update offsets!

aln=seq.CreateAlignment()
aln.AddSequence(seq.CreateSequence("A", "abcd---hik"))
aln.AddSequence(seq.CreateSequence("B", "1234567890"))
aln.Cut(4, 7)

print(aln.ToString(80))
# will print
# A abcdhik
# B 1234890
Replace(new_region, start, end)

Replace the columns in the half-closed interval start, end with the columns in new_region.

Parameters:

new_region (AlignedRegion or AlignmentHandle) – The region to be inserted

GetMatchingBackboneViews(index1=0, index2=1)

Returns a tuple of entity views containing matching backbone atoms for the two sequences at index1 and index2, respectively. For each aligned column in the alignment, backbone atoms are added to the view if both aligned residues have them. It is guaranteed that the two views contain the same number of atoms and that the order of the atoms in the two views is the same.

The output of this function can be used to superpose two structures with SuperposeSVD().

Parameters:
  • index1 – The index of the first sequence

  • index2 – The index of the second sequence.

Raises:

In case one of the two sequences doesn’t have an attached view, a RuntimeError is raised.

AddSequence(sequence)

Append a sequence to the alignment. The sequence must have the same length as sequences already present in the alignment.

Raises:

RuntimeError if the sequence length does not match

Parameters:

sequence (ConstSequenceHandle) – Sequence to be added

GetSequenceOffset(index)
SetSequenceOffset(index, offset)

Get/set the offset for sequence at index (see SequenceHandle.offset).

Parameters:
  • index (int) – The index of the sequence

  • offset (int) – The new offset

Return type:

int

GetSequenceRole(index)
SetSequenceRole(index, role)

Get/Set the sequence role for sequence at index (see SequenceHandle.role).

Parameters:
  • index (int) – The index of the sequence

  • role (str) – The new role

Return type:

str

GetCoverage(index)

Get coverage of sequence at index to the first sequence.

Parameters:

index (int) – The index of the sequence

Returns:

Coverage as a number between 0 and 1.

RemoveSequence(index)

Remove sequence at index from the alignment.

sequences

Shorthand for GetSequences()

sequence_count

Shorthand for GetCount()

__getitem__(pos)
Returns:

Column at position pos of alignment.

Return type:

AlignedColumn

__getitem__(slice)
Returns:

Columns defined by by pythonic slicing.

Return type:

AlignedRegion

class AlignedRegion

Represents a slice of an AlignmentHandle.

GetAlignmentHandle()
Returns:

Alignment from which we slices.

Return type:

AlignmentHandle

GetLength()
__len__()
Returns:

Number of columns in the slice.

__getitem__(pos)
Returns:

Column at position pos within this slice.

Return type:

AlignedColumn

start

Starting position in alignment.

end

One after end position in alignment.

class AlignedColumn
GetIndex()
Returns:

Position in alignment.

GetRowCount()
Returns:

Number of rows in the column.

GetResidue(row)
Returns:

Attached residue for sequence at given row of this column (see AlignmentHandle.GetResidue()).

__getitem__(row)
Returns:

Character at given row of this column.

Return type:

str

__str__()
Returns:

String representation of column in alignment.

Extracting views from sequences

ViewsFromSequences(seq1, seq2)

Returns a tuple of entity views containing only the atoms of the aligned residues. The order of residues in the two views is guaranteed to be the same but the order of atoms within each residue may differ. If the order of atoms is crucial (e.g. for SuperposeSVD()) either prefilter the attached views to include only one atom per residue or use the slower (approx. 50% more runtime) AlignmentHandle.GetMatchingBackboneViews().

Returns:

Pair of views including all the aligned residues of the two given sequences. An alignment is

Return type:

tuple with two EntityView

Raises:

Exception if sequence lengths do not match or if any of the sequences is lacking an attached view.

ViewsFromAlignment(aln, index1=0, index2=1)
Returns:

Pair of views as in ViewsFromSequences().

Return type:

tuple with two EntityView

Parameters:
  • aln (AlignmentHandle) – Alignment from which to extract sequences.

  • index1 (int) – Index of first sequence in aln to use.

  • index2 (int) – Index of second sequence in aln to use.

Handling Sequence Profiles

The ProfileHandle provides a simple container for profiles for each residue. It mainly contains:

  • N ProfileColumn objects (N = number of residues in sequence) which each contains 20 amino acid frequencies

  • a sequence (str) of length N

  • a null_model to use for this profile

Optionally, HMM-related information can be added. This is transition probabilities between Match, Insertion or Deletion states or neff values (number of effective sequences, a measure of local sequence diversity).

class HMMTransition

The possible HMM-transitions between Match(M), Insertion(I) and Deletion(D) states. Transitions between Deletion and Insertion are disallowed:

HMM_M2M, HMM_M2I, HMM_M2D, HMM_I2M, HMM_I2I, HMM_D2M, HMM_D2D

class HMMData

Data container for HMM-related information that can be assigned to profile columns.

neff

Local diversity of all sequences that have a residue at this column of the full alignment.

neff_i

Local diversity of all sequences that have an insertion at this column of the full alignment.

neff_d

Same for deletion.

GetProb(transition)

Get HMM transition probability

Parameters:

transition (HMMTransition) – The transition

SetProb(transition, prob)

Set HMM transition probability

Parameters:
  • transition (HMMTransition) – The transition

  • prob (float) – The probablity to be set

class ProfileColumn
BLOSUMNullModel()

Static method, that returns a new ProfileColumn with amino acid frequencies given from the BLOSUM62 substitution matrix.

HHblitsNullModel()

Static method, that returns a new ProfileColumn with amino acid frequencies as set in HHblits output.

GetFreq(aa)
Returns:

Frequency of aa

Return type:

float

Parameters:

aa (str) – One letter code of standard amino acid

SetFreq(aa, freq)
Parameters:
  • aa (str) – One letter code of standard amino acid

  • freq (float) – The frequency of the given amino acid

GetScore(other, null_model)
Returns:

Column score as in Soeding-2005 paper.

Return type:

float

Parameters:
SetHMMData(data)
Parameters:

data (HMMData) – Data to be set

GetHMMData()

Returns previously set HMMData object.

Return type:

HMMData

Raises:

Error if data has never been set.

entropy

Shannon entropy based on the columns amino acid frequencies

Type:

float

hmm_data

Shortcut for GetHMMData()/SetHMMData()

class ProfileHandle
__len__()

Returns the length of the sequence for which we have profile.

Return type:

int

AddColumn(col, olc='X')

Appends column in the internal column list.

Parameters:
  • col (str) – Column to add to columns

  • olc – One letter code to add to sequence

Extract(from, to)
Parameters:
  • from (int) – Col Idx to start from

  • to (int) – End Idx, not included in sub-ProfileHandle

Returns:

sub-profile as defined by given indices (null_model and neff are copied, you might want to manually reset neff)

Return type:

ProfileHandle

Raises:

Error if to <= from or to > __len__().

GetAverageScore(other, offset=0)
Returns:

Average column score between other.columns[i] and this object’s columns[i+offset] for i in [0, len(other)-1] using this object’s null_model. See ProfileColumn.GetScore().

Return type:

float

Parameters:
  • other (ProfileHandle) – Other profile to compare with.

  • offset (int) – Start comparison at column offset of this object.

Raises:

Error if any columns[i+offset] out of bounds.

sequence

Sequence for which we have this profile. When setting a new value, the length and the number of profile columns must match (exception thrown otherwise).

Type:

str

columns

Iterable columns of the profile (read-only).

Type:

ProfileColumnList

null_model

Null model of the profile. By default this is set to ProfileColumn.HHblitsNullModel().

Type:

ProfileColumn

avg_entropy

Average entropy of all the columns (read-only).

Type:

float

neff

Measure for sequence diversity which is defined as the average of the per-column neff values. However, this is just a convenience attribute which can be set to arbitrary values but there is no guarantee that it’s the actual average of the per-column values.

class ProfileDB

A simple database to gather ProfileHandle objects. It is possible to save them to disk in a compressed format with limited accuracy (4 digits for each frequency).

Save(filename)
Parameters:

filename (str) – Name of file that will be generated on disk.

Load(filename)

Static loading method

Parameters:

filename (str) – Name of file from which the database should be loaded.

Returns:

The loaded database

AddProfile(name, prof)
Parameters:
  • name (str) – Name of profile to be added

  • prof (ProfileHandle) – Profile to be added

Raises:

Exception when filename is longer than 255 characters.

GetProfile(name)
Parameters:

name (str) – Name of profile to be returned

Returns:

The requested ProfileHandle

Raises:

Exception when no ProfileHandle for name exists.

Size()
Returns:

Number of ProfileHandle objects in the database

GetNames()
Returns:

A nonsorted list of the names of all ProfileHandle objects in the database