Linear Database¶
Many applications require to load lots of structures. Expecially on distributed file systems, io becomes a problem. OST provides a linear database to dump position data, e.g. CA positions, or character data, e.g. sequences, to allow fast retrieval. The actual data container behave like linear data arrays and the idea is to use an indexer to keep track of where to find data for a certain entry.
-
class
LinearIndexer
¶ The idea of the
LinearIndexer
is to keep track of locations of data assuming a linear memory layout. The level of entries in the indexer are assemblies that can contain an arbitrary number of chains with varying length. Whenever a new assembly is added, a range enclosing all residues of that assembly is defined that is subsequent to the range of the previously added assembly. It is then not only possible to access the range of the full assembly, but also the range of single chains. Whenever an assembly with n residues is deleted, the ranges of all assemblies that have been added later on are reduced by n.-
Load
(filename)¶ Loads indexer from file
Parameters: filename ( str
) – Path to file to be loadedReturns: The loaded indexer Return type: LinearIndexer
Raises: ost.Error
if filename cannot be opened
-
Save
(filename)¶ Saves indexer to file
Parameters: filename ( str
) – Path to file where the indexer is storedRaises: ost.Error
if filename cannot be created
-
AddAssembly
(name, chain_names, chain_lenths)¶ Adds a new assembly to the indexer. The range assigned to that assembly is subsequent to the previously added assembly.
Parameters: - name (
str
) – Name of the added assembly - chain_names (
list
ofstr
) – Names of all chains of the added assembly - chain_lengths (
list
ofint
) – The according lengths of the chains
Raises: ost.Error
if lengths of chain_names and chain_lengths is inconsistent- name (
-
RemoveAssembly
(name)¶ Removes an assembly from the indexer. Assuming that assembly contains a total of n residues, all ranges of the subsequent assemblies are reduced by n.
Parameters: name – Name of the assembly to be removed Raises: ost.Error
if name is not present
-
GetAssemblies
()¶ Returns: The names of all added assemblies Return type: list
ofstr
Raises: ost.Error
if name is not present
-
GetChainNames
(name)¶ Parameters: name ( str
) – Name of assembly from which you want the chain namesReturns: The chain names of the specified assembly Return type: list
ofstr
Raises: ost.Error
if name is not present
-
GetChainLengths
(name)¶ Parameters: name ( str
) – Name of assembly from which you want the chain lengthsReturns: The chain lengths of the specified assembly Return type: list
ofint
Raises: ost.Error
if name is not present
-
GetDataRange
(name)¶ Get the range for a full assembly
Parameters: name ( str
) – Name of the assembly from which you want the rangeReturns: Two values defining the range as [from, to[ Return type: tuple
ofint
Raises: ost.Error
if name is not present
-
GetDataRange
(name, chain_name) Get the range for a chain of an assembly
Parameters: - name (
str
) – Name of the assembly from which you want the range - chain_name (
str
) – Name of the chain from which you want the range
Returns: Two values defining the range as [from, to[
Return type: tuple
ofint
Raises: ost.Error
if name is not present or the according assembly has no chain with specified chain name- name (
-
GetNumResidues
()¶ Returns: The total number of residues in all added assemblies Return type: int
-
-
class
LinearCharacterContainer
¶ The
LinearCharacterContainer
stores characters in a linear memory layout that can represent sequences such as SEQRES or ATOMSEQ. It can be accessed using range parameters and the idea is to keep it in sync with aLinearIndexer
.-
Load
(filename)¶ Loads container from file
Parameters: filename ( str
) – Path to file to be loadedReturns: The loaded container Return type: LinearCharacterContainer
Raises: ost.Error
if filename cannot be opened
-
Save
(filename)¶ Saves container to file
Parameters: filename ( str
) – Path to file where the container is storedRaises: ost.Error
if filename cannot be created
-
AddCharacters
(characters)¶ Adds characters at the end of the internal data. Call this function with appropriate data whenever you add an assembly to the associated
LinearIndexer
Parameters: characters ( str
) – Characters to be added
-
ClearRange
(range)¶ Removes all characters specified by range in form [from, to [ from the internal data. The internal data layout is linear, all characters starting from to are shifted to the location defined by from. Call this function with appropriate range whenever you remove an assembly from the associated
LinearIndexer
Parameters: range ( tuple
ofint
) – Range to be deleted in form [from, to[Raises: ost.Error
if range does not specify a valid range
-
GetCharacter
(idx)¶ Returns: The character at the specified location Return type: str
Raises: ost.Error
if idx does not specify a valid position
-
GetCharacters
(range)¶ Returns: The characters from the specified range Return type: str
Raises: ost.Error
if range does not specify a valid range
-
GetNumElements
()¶ Returns: The number of stored characters Rypte: int
-
-
class
LinearPositionContainer
¶ The
LinearPositionContainer
stores positions in a linear memory layout. It can be accessed using range parameters and the idea is to keep it in sync with aLinearIndexer
. In order to save some memory, a lossy compression is applied that results in a limited accuracy of two digits. if the absolute value of your added position is very large (> ~10000), the accuracy is further lowered to one digit. This is all handled internally.-
Load
(filename)¶ Loads container from file
Parameters: filename ( str
) – Path to file to be loadedReturns: The loaded container Return type: LinearPositionContainer
Raises: ost.Error
if filename cannot be opened
-
Save
(filename)¶ Saves container to file
Parameters: filename ( str
) – Path to file where the container is storedRaises: ost.Error
if filename cannot be created
-
AddPositions
(positions)¶ Adds positions at the end of the internal data. Call this function with appropriate data whenever you add an assembly to the associated
LinearIndexer
Parameters: positions ( ost.geom.Vec3List
) – Positions to be added
-
ClearRange
(range)¶ Removes all positions specified by range in form [from, to [ from the internal data. The internal data layout is linear, all positions starting from to are shifted to the location defined by from. Call this function with appropriate range whenever you remove an assembly from the associated
LinearIndexer
Parameters: range ( tuple
ofint
) – Range to be deleted in form [from, to[Raises: ost.Error
if range does not specify a valid range
-
GetPosition
(idx, pos)¶ Extracts a position at specified location. For efficiency reasons, the function requires the position to be passed as reference.
Parameters: - idx (
int
) – Specifies location - pos (
ost.geom.Vec3
) – Will be altered to the desired position
Raises: ost.Error
if idx does not specify a valid position- idx (
-
GetPositions
(range, positions)¶ Extracts positions at specified range. For efficiency reasons, the function requires the positions to be passed as reference.
Parameters: - range (
tuple
ofint
) – Range in form [from,to[ that defines positions to be extracted - positions (
ost.geom.Vec3List
) – Will be altered to the desired positions
Raises: ost.Error
if range does not specify a valid range- range (
-
GetNumElements
()¶ Returns: The number of stored positions Rypte: int
-
Data Extraction¶
Openstructure provides data extraction functionality for the following scenario:
There are three binary container. A position container to hold CA-positions
(LinearPositionContainer
), a SEQRES container and
an ATOMSEQ container (both: LinearCharacterContainer
).
They contain entries from the protein structure database
and sequence/position data is relative to the SEQRES of those entries.
This means, if the SEQRES has more characters as there are resolved residues
in the structure, the entry in the position container still contains the exact
number of SEQRES characters but some position remain invalid. Thats where the
ATOMSEQ container comes in. It only contains matching residues to the SEQRES but
marks non-resolved residues with ‘-‘.
-
ExtractValidPositions
(entry_name, chain_name, indexer, atomseq_container, position_container, seq, positions)¶ Iterates over all data for a chain specified by entry_name and chain_name. For every data point marked as valid in the atomseq_container (character at that position is not ‘-‘), the character and the corresponding position are added to seq and positions
Parameters: - entry_name (
str
) – Name of assembly you want the data from - chain_name (
str
) – Name of chain you want the data from - indexer (
LinearIndexer
) – Used to access atomseq_container and position_container - atomseq_container (
LinearCharacterContainer
) – Container that marks locations with invalid position data with ‘-‘ - position_container (
LinearPositionContainer
) – Container containing position data - seq (
ost.seq.SequenceHandle
) – Sequence with extracted valid positions gets stored in here. - positions (
ost.geom.Vec3List
) – The extracted valid positions get stored in here
Raises: ost.Error
if requested data is not present- entry_name (
-
ExtractTemplateData
(entry_name, chain_name, aln, indexer, seqres_container, atomseq_container, position_container)¶ Let’s say we have a target-template alignment in aln (first seq: target, second seq: template). This function extracts all valid template positions given the entry specified by entry_name and chain_name. The template sequence in aln must match the sequence in seqres_container. Again, the atomseq_container is used to identify valid positions. The according residue numbers relative to the target sequence in aln are also returned.
Parameters: - entry_name (
str
) – Name of assembly you want the data from - chain_name (
str
) – Name of chain you want the data from - aln – Target-template sequence alignment
- indexer (
LinearIndexer
) – Used to access atomseq_container, seqres_container and position_container - seqres_container (
LinearCharacterContainer
) – Container containing the full sequence data - atomseq_container (
LinearCharacterContainer
) – Container that marks locations with invalid position data with ‘-‘ - position_container (
LinearPositionContainer
) – Container containing position data
Returns: First element:
list
of residue numbers that relate each entry in the second element to the target sequence specified in aln. The numbering scheme starts from one. Second Element:geom.Vec3List
with the according positions.Return type: tuple
Raises: ost.Error
if requested data is not present in the container or if the template sequence in aln doesn’t match with the sequence in seqres_container- entry_name (