Linear Database

Many applications require to load lots of structures. Expecially on distributed file systems, io becomes a problem. OST provides a linear database to dump position data, e.g. CA positions, or character data, e.g. sequences, to allow fast retrieval. The actual data container behave like linear data arrays and the idea is to use an indexer to keep track of where to find data for a certain entry.

class LinearIndexer

The idea of the LinearIndexer is to keep track of locations of data assuming a linear memory layout. The level of entries in the indexer are assemblies that can contain an arbitrary number of chains with varying length. Whenever a new assembly is added, a range enclosing all residues of that assembly is defined that is subsequent to the range of the previously added assembly. It is then not only possible to access the range of the full assembly, but also the range of single chains. Whenever an assembly with n residues is deleted, the ranges of all assemblies that have been added later on are reduced by n.

Load(filename)

Loads indexer from file

Parameters:

filename (str) – Path to file to be loaded

Returns:

The loaded indexer

Return type:

LinearIndexer

Raises:

ost.Error if filename cannot be opened

Save(filename)

Saves indexer to file

Parameters:

filename (str) – Path to file where the indexer is stored

Raises:

ost.Error if filename cannot be created

AddAssembly(name, chain_names, chain_lenths)

Adds a new assembly to the indexer. The range assigned to that assembly is subsequent to the previously added assembly.

Parameters:
  • name (str) – Name of the added assembly

  • chain_names (list of str) – Names of all chains of the added assembly

  • chain_lengths (list of int) – The according lengths of the chains

Raises:

ost.Error if lengths of chain_names and chain_lengths is inconsistent

RemoveAssembly(name)

Removes an assembly from the indexer. Assuming that assembly contains a total of n residues, all ranges of the subsequent assemblies are reduced by n.

Parameters:

name – Name of the assembly to be removed

Raises:

ost.Error if name is not present

GetAssemblies()
Returns:

The names of all added assemblies

Return type:

list of str

Raises:

ost.Error if name is not present

GetChainNames(name)
Parameters:

name (str) – Name of assembly from which you want the chain names

Returns:

The chain names of the specified assembly

Return type:

list of str

Raises:

ost.Error if name is not present

GetChainLengths(name)
Parameters:

name (str) – Name of assembly from which you want the chain lengths

Returns:

The chain lengths of the specified assembly

Return type:

list of int

Raises:

ost.Error if name is not present

GetDataRange(name)

Get the range for a full assembly

Parameters:

name (str) – Name of the assembly from which you want the range

Returns:

Two values defining the range as [from, to[

Return type:

tuple of int

Raises:

ost.Error if name is not present

GetDataRange(name, chain_name)

Get the range for a chain of an assembly

Parameters:
  • name (str) – Name of the assembly from which you want the range

  • chain_name (str) – Name of the chain from which you want the range

Returns:

Two values defining the range as [from, to[

Return type:

tuple of int

Raises:

ost.Error if name is not present or the according assembly has no chain with specified chain name

GetNumResidues()
Returns:

The total number of residues in all added assemblies

Return type:

int

class LinearCharacterContainer

The LinearCharacterContainer stores characters in a linear memory layout that can represent sequences such as SEQRES or ATOMSEQ. It can be accessed using range parameters and the idea is to keep it in sync with a LinearIndexer.

Load(filename)

Loads container from file

Parameters:

filename (str) – Path to file to be loaded

Returns:

The loaded container

Return type:

LinearCharacterContainer

Raises:

ost.Error if filename cannot be opened

Save(filename)

Saves container to file

Parameters:

filename (str) – Path to file where the container is stored

Raises:

ost.Error if filename cannot be created

AddCharacters(characters)

Adds characters at the end of the internal data. Call this function with appropriate data whenever you add an assembly to the associated LinearIndexer

Parameters:

characters (str) – Characters to be added

ClearRange(range)

Removes all characters specified by range in form [from, to [ from the internal data. The internal data layout is linear, all characters starting from to are shifted to the location defined by from. Call this function with appropriate range whenever you remove an assembly from the associated LinearIndexer

Parameters:

range (tuple of int) – Range to be deleted in form [from, to[

Raises:

ost.Error if range does not specify a valid range

GetCharacter(idx)
Returns:

The character at the specified location

Return type:

str

Raises:

ost.Error if idx does not specify a valid position

GetCharacters(range)
Returns:

The characters from the specified range

Return type:

str

Raises:

ost.Error if range does not specify a valid range

GetNumElements()
Returns:

The number of stored characters

Rypte:

int

class LinearPositionContainer

The LinearPositionContainer stores positions in a linear memory layout. It can be accessed using range parameters and the idea is to keep it in sync with a LinearIndexer. In order to save some memory, a lossy compression is applied that results in a limited accuracy of two digits. if the absolute value of your added position is very large (> ~10000), the accuracy is further lowered to one digit. This is all handled internally.

Load(filename)

Loads container from file

Parameters:

filename (str) – Path to file to be loaded

Returns:

The loaded container

Return type:

LinearPositionContainer

Raises:

ost.Error if filename cannot be opened

Save(filename)

Saves container to file

Parameters:

filename (str) – Path to file where the container is stored

Raises:

ost.Error if filename cannot be created

AddPositions(positions)

Adds positions at the end of the internal data. Call this function with appropriate data whenever you add an assembly to the associated LinearIndexer

Parameters:

positions (ost.geom.Vec3List) – Positions to be added

ClearRange(range)

Removes all positions specified by range in form [from, to [ from the internal data. The internal data layout is linear, all positions starting from to are shifted to the location defined by from. Call this function with appropriate range whenever you remove an assembly from the associated LinearIndexer

Parameters:

range (tuple of int) – Range to be deleted in form [from, to[

Raises:

ost.Error if range does not specify a valid range

GetPosition(idx, pos)

Extracts a position at specified location. For efficiency reasons, the function requires the position to be passed as reference.

Parameters:
  • idx (int) – Specifies location

  • pos (ost.geom.Vec3) – Will be altered to the desired position

Raises:

ost.Error if idx does not specify a valid position

GetPositions(range, positions)

Extracts positions at specified range. For efficiency reasons, the function requires the positions to be passed as reference.

Parameters:
  • range (tuple of int) – Range in form [from,to[ that defines positions to be extracted

  • positions (ost.geom.Vec3List) – Will be altered to the desired positions

Raises:

ost.Error if range does not specify a valid range

GetNumElements()
Returns:

The number of stored positions

Rypte:

int

Data Extraction

Openstructure provides data extraction functionality for the following scenario: There are three binary container. A position container to hold CA-positions (LinearPositionContainer), a SEQRES container and an ATOMSEQ container (both: LinearCharacterContainer). They contain entries from the protein structure database and sequence/position data is relative to the SEQRES of those entries. This means, if the SEQRES has more characters as there are resolved residues in the structure, the entry in the position container still contains the exact number of SEQRES characters but some position remain invalid. Thats where the ATOMSEQ container comes in. It only contains matching residues to the SEQRES but marks non-resolved residues with ‘-‘.

ExtractValidPositions(entry_name, chain_name, indexer, atomseq_container, position_container, seq, positions)

Iterates over all data for a chain specified by entry_name and chain_name. For every data point marked as valid in the atomseq_container (character at that position is not ‘-‘), the character and the corresponding position are added to seq and positions

Parameters:
  • entry_name (str) – Name of assembly you want the data from

  • chain_name (str) – Name of chain you want the data from

  • indexer (LinearIndexer) – Used to access atomseq_container and position_container

  • atomseq_container (LinearCharacterContainer) – Container that marks locations with invalid position data with ‘-’

  • position_container (LinearPositionContainer) – Container containing position data

  • seq (ost.seq.SequenceHandle) – Sequence with extracted valid positions gets stored in here.

  • positions (ost.geom.Vec3List) – The extracted valid positions get stored in here

Raises:

ost.Error if requested data is not present

ExtractTemplateData(entry_name, chain_name, aln, indexer, seqres_container, atomseq_container, position_container)

Let’s say we have a target-template alignment in aln (first seq: target, second seq: template). This function extracts all valid template positions given the entry specified by entry_name and chain_name. The template sequence in aln must match the sequence in seqres_container. Again, the atomseq_container is used to identify valid positions. The according residue numbers relative to the target sequence in aln are also returned.

Parameters:
  • entry_name (str) – Name of assembly you want the data from

  • chain_name (str) – Name of chain you want the data from

  • aln – Target-template sequence alignment

  • indexer (LinearIndexer) – Used to access atomseq_container, seqres_container and position_container

  • seqres_container (LinearCharacterContainer) – Container containing the full sequence data

  • atomseq_container (LinearCharacterContainer) – Container that marks locations with invalid position data with ‘-’

  • position_container (LinearPositionContainer) – Container containing position data

Returns:

First element: list of residue numbers that relate each entry in the second element to the target sequence specified in aln. The numbering scheme starts from one. Second Element: geom.Vec3List with the according positions.

Return type:

tuple

Raises:

ost.Error if requested data is not present in the container or if the template sequence in aln doesn’t match with the sequence in seqres_container