mmCIF File Format¶
The mmCIF file format is a container for structural entities provided by the
PDB. Saving/loading happens through dedicated convenient functions
(ost.io.LoadMMCIF()
/ost.io.SaveMMCIF()
). Here provide more in-depth
information on mmCIF IO and describe how to deal with information provided above
the legacy PDB format (MMCifInfo
,
MMCifInfoCitation
, MMCifInfoTransOp
,
MMCifInfoBioUnit
, MMCifInfoStructDetails
,
MMCifInfoObsolete
, MMCifInfoStructRef
,
MMCifInfoStructRefSeq
, MMCifInfoStructRefSeqDif
,
MMCifInfoRevisions
, MMCifInfoEntityBranchLink
).
Reading mmCIF files¶
Categories Available¶
The following categories of a mmCIF file are considered by the reader:
atom_site
: Used to build theEntityHandle
entity
: Involved in settingChainType
of chainsentity_poly
: Involved in settingChainType
of chainscitation
: Goes intoMMCifInfoCitation
citation_author
: Goes intoMMCifInfoCitation
refine
: Goes intoMMCifInfo
asresolution
,r_free
andr_work
.em_3d_reconstruction
: Goes intoMMCifInfo
asem_resolution
.pdbx_struct_assembly
: Used forMMCifInfoBioUnit
.pdbx_struct_assembly_gen
: Used forMMCifInfoBioUnit
.pdbx_struct_oper_list
: Used forMMCifInfoBioUnit
.struct
: Details about a structure, stored inMMCifInfoStructDetails
.struct_conf
: Stores secondary structure information (practically helices) in theEntityHandle
struct_sheet_range
: Stores secondary structure information for sheets in theEntityHandle
pdbx_database_PDB_obs_spr
: Verbose information on obsoleted/ superseded entries, stored inMMCifInfoObsolete
struct_ref
stored inMMCifInfoStructRef
struct_ref_seq
stored inMMCifInfoStructRefSeq
struct_ref_seq_dif
stored inMMCifInfoStructRefSeqDif
database_pdb_rev
(mmCIF dictionary version < 5) stored inMMCifInfoRevisions
pdbx_audit_revision_history
andpdbx_audit_revision_details
(mmCIF dictionary version >= 5) used to fillMMCifInfoRevisions
pdbx_entity_branch
andpdbx_entity_branch_link
used forMMCifInfoEntityBranchLink
, a list of links is available byGetEntityBranchLinks()
andGetEntityBranchByChain()
Notes:
Structures in mmCIF format can have two chain names. The “new” chain name extracted from
atom_site.label_asym_id
is used to name the chains in theEntityHandle
. The “old” (author provided) chain name is extracted from_atom_site.auth_asym_id
for the first atom of the chain. It is added as string property named “pdb_auth_chain_name” to theChainHandle
. The mapping is also stored inMMCifInfo
asGetMMCifPDBChainTr()
andGetPDBMMCifChainTr()
(the latter only for polymer chains).Molecular entities in mmCIF are identified by an
entity.id
, which is extracted fromatom_site.label_entity_id
for the first atom of the chain. It is added as string property named “entity_id” to theChainHandle
. Each chain is mapped to an ID inMMCifInfo
asGetMMCifEntityIdTr()
.For more complex mappings, such as ligands which may be in a same “old” chain as the protein chain but are represented in a separate “new” chain in mmCIF, we also store
string properties
on a per-residue level. For mmCIF files from the PDB, there is a unique mapping between (label_asym_id
,label_seq_id
) and (auth_asym_id
,auth_seq_id
,pdbx_PDB_ins_code
). The following data items are available:atom_site.label_asym_id
:residue.chain.name
_atom_site.label_seq_id
:residue.GetStringProp("resnum")
(this is the same asresidue.number
for residues in polymer chains. However, for ligandsresidue.number
is unset in mmCIF, but it is set to 1 by openstructure.)atom_site.label_entity_id
:residue.GetStringProp("entity_id")
_atom_site.auth_asym_id
:residue.GetStringProp("pdb_auth_chain_name")
atom_site.auth_seq_id
:residue.GetStringProp("pdb_auth_resnum")
atom_site.pdbx_PDB_ins_code
:residue.GetStringProp("pdb_auth_ins_code")
The last two items might be missing (not empty) if the
atom_site.auth_seq_id
oratom_site.pdbx_PDB_ins_code
are not present in the mmCIF file.Missing values in the aforementioned data items will be denoted as
.
or?
.Author residue numbers (
atom_site.auth_seq_id
) and insertion codes (atom_site.pdbx_PDB_ins_code
) are optional according to the mmCIF dictionary. The data items (whole columns) can be omitted in structures where the “new” residue numbers (_atom_site.label_seq_id
) are defined (to valid values). This is usually the case for polymer chains. However non-polymer and water chains do not have valid “new” residue numbers. In structures containing such missing data, OST requires the presence of both “old” residue numbers and insertion codes in order to identify and build residues properly. It is a known limitation of the mmCIF format to allow ambiguous identifiers for waters (and ligands to some extent) and so we have to require these additional identifiers.
Info Classes¶
Information from mmCIF files that goes beyond structural data, is kept in a
special container, the MMCifInfo
class. Here is a detailed description
of the annotation available.
- class MMCifInfo¶
This is the container for all bits of non-molecular data pulled from a mmCIF file.
- citations¶
Stores a list of citations (
MMCifInfoCitation
).Also available as
GetCitations()
.
- biounits¶
Stores a list of biounits (
MMCifInfoBioUnit
).Also available as
GetBioUnits()
.
- method¶
Stores the experimental method used to create the structure (
_exptl.method
).Also available as
GetMethod()
. May also be modified bySetMethod()
.Some PDB entries have multiple experimental methods. This function returns only a single one of them.
- resolution¶
Stores the resolution of the crystal structure, obtained from the
refine.ls_d_res_high
data item. Set to 0 if no value in loaded mmCIF file.Also available as
GetResolution()
. May also be modified bySetResolution()
.
- em_resolution¶
Stores the resolution of the EM reconstruction, obtained from the
em_3d_reconstruction.resolution
data item. Set to 0 if no value in loaded mmCIF file.Also available as
GetEMResolution()
. May also be modified bySetEMResolution()
.
- r_free¶
Stores the R-free value of the crystal structure. Set to 0 if no value in loaded mmCIF file.
Also available as
GetRFree()
. May also be modified bySetRFree()
.
- r_work¶
Stores the R-work value of the crystal structure. Set to 0 if no value in loaded mmCIF file.
Also available as
GetRWork()
. May also be modified bySetRWork()
.
- operations¶
Stores the operations needed to transform a crystal structure into a bio unit.
Also available as
GetOperations()
. May also be modified byAddOperation()
.
- struct_details¶
Stores details about the structure in a
MMCifInfoStructDetails
object.Also available as
GetStructDetails()
. May also be modified bySetStructDetails()
.
- struct_refs¶
Lists all links to external databases in the mmCIF file as a list of
MMCifInfoStructRef
.Also available as
GetStructRefs()
. May also be modified bySetStructRefs()
.
- revisions¶
Stores a simple history of a PDB entry.
Also available as
GetRevisions()
. May be extended byAddRevision()
.- Type:
- obsolete¶
Stores information about obsoleted / superseded entries.
Also available as
GetObsoleteInfo()
. May also be modified bySetObsoleteInfo()
.- Type:
- AddCitation(citation)¶
Add a citation to the citation list of an info object.
- Parameters:
citation (
MMCifInfoCitation
) – Citation to be added.
- AddAuthorsToCitation(id, authors, fault_tolerant=False)¶
Adds a list of authors to a specific citation.
- Parameters:
id (
str
) – Identifier of the citation.authors (
StringList
) – List of authors.fault_tolerant (
bool
) – Logs a warning if id is not found and proceeds without setting anything if set to True. Raises otherwise.
- AddBioUnit(biounit)¶
Add a bio unit to the bio unit list of an info object. If the
id
ofbiounit
already exists in the set of assemblies, both will be merged. This means thatchain
andoperations
lists will be concatenated and the interval lists (operationsintervalls
,chainintervalls
) will be updated.- Parameters:
biounit (
MMCifInfoBioUnit
) – Bio unit to be added.
- SetResolution(resolution)¶
See
resolution
- GetResolution()¶
See
resolution
- AddOperation(operation)¶
See
operations
- GetOperations()¶
See
operations
- SetStructDetails(details)¶
See
struct_details
- GetStructDetails()¶
See
struct_details
- SetStructRef(refs)¶
See
struct_refs
- GetStructRef()¶
See
struct_refs
- AddMMCifPDBChainTr(cif_chain_id, pdb_chain_id)¶
Set up a translation for a certain mmCIF chain name to the traditional PDB chain name.
- Parameters:
cif_chain_id (
str
) – atom_site.label_asym_idpdb_chain_id (
str
) –_atom_site.auth_asym_id
- GetMMCifPDBChainTr(cif_chain_id)¶
Get the translation of a certain mmCIF chain name to the traditional PDB chain name.
- Parameters:
cif_chain_id (
str
) – atom_site.label_asym_id- Returns:
_atom_site.auth_asym_id
asstr
(empty if no mapping)
- AddPDBMMCifChainTr(pdb_chain_id, cif_chain_id)¶
Set up a translation for a certain PDB chain name to the mmCIF chain name.
- Parameters:
pdb_chain_id (
str
) –_atom_site.auth_asym_id
cif_chain_id (
str
) – atom_site.label_asym_id
- GetPDBMMCifChainTr(pdb_chain_id)¶
Get the translation of a certain PDB chain name to the mmCIF chain name.
- Parameters:
pdb_chain_id (
str
) –_atom_site.auth_asym_id
- Returns:
atom_site.label_asym_id as
str
(empty if no mapping)
- AddMMCifEntityIdTr(cif_chain_id, entity_id)¶
Set up a translation for a certain mmCIF chain name to the mmCIF entity ID.
- Parameters:
cif_chain_id (
str
) – atom_site.label_asym_identity_id (
str
) – atom_site.label_entity_id
- GetMMCifEntityIdTr(cif_chain_id)¶
Get the translation of a certain mmCIF chain name to the mmCIF entity ID.
- Parameters:
cif_chain_id (
str
) – atom_site.label_asym_id- Returns:
atom_site.label_entity_id as
str
(empty if no mapping)
- GetEntityIdsOfType(type)¶
Get list of entity ids for which
MMCifEntityDesc.entity_type
equals type- Parameters:
type (
str
) – Selection criteria of returned entity ids- Returns:
list
ofstr
representing selected entity ids
- AddRevision(num, date, status, major=-1, minor=-1)¶
Add a new iteration to the revision history. See
MMCifInfoRevisions.AddRevision()
.
- SetRevisionsDateOriginal(date)¶
Set the date, when this entry first entered the PDB. Ignored if it was set in the past. See
MMCifInfoRevisions.SetDateOriginal()
.
- GetEntityBranchLinks()¶
Get bond information for branched entities. Returns all
MMCifInfoEntityBranchLink
objects in one list. Chain and residue information is available by the storedAtomHandles
of each entry.- Returns:
list
ofMMCifInfoEntityBranchLink
- GetEntityBranchByChain(chain_name)¶
Get bond information for chains with branched entities. Returns all
MMCifInfoEntityBranchLink
objects in one list if chain is a branched entity, an empty list otherwise.- Parameters:
chain_name (
str
) – Chain name to check for branch links- Returns:
list
ofMMCifInfoEntityBranchLink
- AddEntityBranchLink(chain_name, atom1, atom2, bond_order)¶
Add bond information for a branched entity.
- Parameters:
chain_name (
str
) – Chain the bond belongs toatom1 (
AtomHandle
) – First atom of the bondatom2 (
AtomHandle
) – Second atom of the bondbond_order (
int
) – Bond order (e.g. 1=single, 2=double, 3=triple)
- Returns:
Nothing
- GetEntityBranchChainNames()¶
Get a list of chain names which contain branched entities.
- Returns:
list
ofstr
- GetEntityBranchChains()¶
Get a list of chains which contain branched entities.
- Returns:
list
ofChainHandle
- ConnectBranchLinks()¶
Establish all bonds stored for branched entities.
- GetEntityDesc(entity_id)¶
Get info of type
MMCifEntityDesc
for specified entity_id. The entity id for a chain can be fetched withGetMMCifEntityIdTr()
.- Parameters:
entity_id (
str
) – ID of entity
- class MMCifInfoCitation¶
This stores citation information from an input file.
- id¶
Stores an internal identifier for a citation. If not provided, resembles an empty string.
Also available as
GetID()
. May also be modified bySetID()
.
- cas¶
Stores a Chemical Abstract Service identifier if available. If not provided, resembles an empty string.
Also available as
GetCAS()
. May also be modified bySetCas()
.
- isbn¶
Stores the ISBN code, presumably for cited books. If not provided, resembles an empty string.
Also available as
GetISBN()
. May also be modified bySetISBN()
.
- published_in¶
Stores the book or journal title of a publication. Should take the full title, no abbreviations. If not provided, resembles an empty string.
Also available as
GetPublishedIn()
. May also be modified bySetPublishedIn()
.
- volume¶
Supposed to store volume information for journals. Since the volume number is not always a simple integer, it is stored as a string. If not provided, resembles an empty string.
Also available as
GetVolume()
. May also be modified bySetVolume()
.
- page_first¶
Stores the first page of a publication. Since the page numbers are not always a simple integers, they are stored as strings. If not provided, resembles empty strings.
Also available as
GetPageFirst()
. May also be modified bySetPageFirst()
.
- page_last¶
Stores the last page of a publication. Since the page numbers are not always a simple integers, they are stored as strings. If not provided, resembles empty strings.
Also available as
GetPageLast()
. May also be modified bySetPageLast()
.
- doi¶
Stores the Document Object Identifier as used by doi.org for a cited document. If not provided, resembles an empty string.
Also available as
GetDOI()
. May also be modified bySetDOI()
.
- pubmed¶
Stores the PubMed accession number. If not provided, is set to 0.
Also available as
GetPubMed()
. May also be modified bySetPubmed()
.
- year¶
Stores the publication year. If not provided, is set to 0.
Also available as
GetYear()
. May also be modified bySetYear()
.
- title¶
Stores a title. If not provided, is set to an empty string.
Also available as
GetTitle()
. May also be modified bySetTitle()
.
- book_publisher¶
Name of publisher of the citation, relevant for books and book chapters.
Also available as
GetBookPublisher()
andSetBookPublisher()
.
- book_publisher_city¶
City of the publisher of the citation, relevant for books and book chapters.
Also available as
GetBookPublisherCity()
andSetBookPublisherCity()
.
- citation_type¶
Defines where a citation was published. Either journal, book or unknown.
Also available as
GetCitationType()
. May also be modified bySetCitationType()
with values fromMMCifInfoCType
. For conveinience settersSetCitationTypeJournal()
,SetCitationTypeBook()
andSetCitationTypeUnknown()
exist.For checking the type of a citation,
IsCitationTypeJournal()
,IsCitationTypeBook()
andIsCitationTypeUnknown()
can be used.
- authors¶
Stores a
StringList
of authors.Also available as
GetAuthorList()
. May also be modified bySetAuthorList()
.
- GetPublishedIn()¶
See
published_in
- SetPublishedIn(title)¶
See
published_in
- GetPageFirst()¶
See
page_first
- SetPageFirst(first)¶
See
page_first
- GetBookPublisher()¶
See
book_publisher
- SetBookPublisher()¶
See
book_publisher
- GetBookPublisherCity()¶
- SetBookPublisherCity()¶
- GetCitationType()¶
See
citation_type
- SetCitationType(publication_type)¶
See
citation_type
- SetCitationTypeJournal()¶
See
citation_type
- SetCitationTypeBook()¶
See
citation_type
- SetCitationTypeUnknown()¶
See
citation_type
- IsCitationTypeJournal()¶
See
citation_type
- IsCitationTypeBook()¶
See
citation_type
- IsCitationTypeUnknown()¶
See
citation_type
- class MMCifInfoTransOp¶
This stores operations needed to transform an
EntityHandle
into a bio unit.- id¶
A unique identifier. If not provided, resembles an empty string.
- type¶
Describes the operation. If not provided, resembles an empty string.
Also available as
GetType()
. May also be modified bySetType()
.
- translation¶
The translational vector. Also available as
GetVector()
. May also bemodified by
SetVector()
.
- rotation¶
The rotational matrix. Also available as
GetMatrix()
. May also bemodified by
SetMatrix()
.
- GetVector()¶
See
translation
- SetVector(x, y, z)¶
See
translation
- class MMCifInfoBioUnit¶
This stores information how a structure is to be assembled to form the bio unit.
- id¶
The id of a bio unit as given by the original mmCIF file.
Also available as
GetID()
. May also be modified bySetID()
.- Type:
str
- details¶
Special aspects of the biological assembly. If not provided, resembles an empty string.
Also available as
GetDetails()
. May also be modified bySetDetails()
.
- method_details¶
Details about the method used to determine this biological assembly.
Also available as
GetMethodDetails()
. May also be modified bySetMethodDetails()
.
- chains¶
Chains involved in this bio unit. If not provided, resembles an empty list.
Also available as
GetChainList()
. May also be modified byAddChain()
orSetChainList()
.
- chainintervals¶
List of intervals on the chain list. Needed if there a several sets of chains and transformations to create the bio unit. Comes as a list of tuples. First component is the start, second is the right border of the interval.
Also available as
GetChainIntervalList()
. Is automatically modified byAddChain()
,SetChainList()
andMMCifInfo.AddBioUnit()
.
- operations¶
Translations and rotations needed to create the bio unit. Filled with objects of class
MMCifInfoTransOp
.Also available as
GetOperations()
. May be modified byAddOperations()
- operationsintervalls¶
List of intervals on the operations list. Needed if there a several sets of chains and transformations to create the bio unit. Comes as a list of tuples. First component is the start, second is the right border of the interval.
Also available as
GetOperationsIntervalList()
. Is automatically modified byAddOperations()
andMMCifInfo.AddBioUnit()
.
- GetMethodDetails()¶
See
method_details
- SetMethodDetails(details)¶
See
method_details
- SetChainList(chains)¶
See
chains
, also resetschainintervalls
to contain only one interval enclosing the whole chain list.- Parameters:
chains (
StringList
) – List of chain names.
- AddChain(chain name)¶
See
chains
, also extends the right border of the last entry inchainintervalls
.
- GetChainIntervalList()¶
See
chainintervals
- GetOperations()¶
See
operations
- AddOperations(list of operations)¶
See
operations
, also extends the right border of the last entry inoperationsintervalls
.
- GetOperationsIntervalList()¶
- PDBize(asu, seqres=None, min_polymer_size=None, transformation=False, peptide_min_size=10, nucleicacid_min_size=10, saccharide_min_size=10)¶
Returns the biological assembly (bio unit) for an entity. The new entity created is well suited to be saved as a PDB file. Therefore the function tries to meet the requirements of single-character chain names. The following measures are taken.
All ligands are put into one chain (_)
Water is put into one chain (-)
Each polymer gets its own chain, named A-Z 0-9 a-z.
The description of non-polymer chains will be put into a generic string property called description on the residue level.
Ligands that resemble a polymer but have less than min_polymer_size / peptide_min_size / nucleicacid_min_size / saccharide_min_size residues are assigned the same numeric residue number. The residues are distinguished by insertion code.
Sometimes bio units exceed the coordinate system storable in a PDB file. In that case, the box around the entity will be aligned to the lower left corner of the coordinate system.
Since this function is at the moment mainly used to create biounits from mmCIF files to be saved as PDBs, the function assumes that the
ChainType
properties are set correctly. For a more mmCIF-style of doing things read this: Biounits- Parameters:
asu (
EntityHandle
) – Asymmetric unit to work on. Should be created from a mmCIF file.seqres (
SequenceList
) – If set to a valid sequence list, the length of the seqres records will be used to determine if a certain chain has the minimally required length.min_polymer_size (int) – The minimal number of residues a polymer needs to get its own chain. Everything below that number will be sorted into the ligand chain. Overrides peptide_min_size, nucleicacid_min_size and saccharide_min_size if set to a value different than None.
transformation (
bool
) – If set, return the transformation matrix used to move the bounding box of the bio unit to the lower left corner.peptide_min_size (
int
) – Minimal size to get an individual chain for a polypeptide. Is overridden by min_polymer_size.nucleicacid_min_size (
int
) – Minimal size to get an individual chain for a polynucleotide. Is overridden by min_polymer_size.saccharide_min_size (
int
) – Minimal size to get an individual chain for an oligosaccharide or polysaccharide. Is overridden by min_polymer_size.
- class MMCifInfoStructDetails¶
Holds details about the structure.
- entry_id¶
Identifier for a curtain data block. If not provided, resembles an empty string.
Also available as
GetEntryID()
. May also be modified bySetEntryID()
.
- title¶
Set a title for the structure.
Also available as
GetTitle()
. May also be modified bySetTitle()
.
- casp_flag¶
Tells whether this structure was a target in some competition.
Also available as
GetCASPFlag()
. May also be modified bySetCASPFlag()
.
- descriptor¶
Descriptor for an NDB structure or the unstructured content of a PDB COMPND record.
Also available as
GetDescriptor()
. May also be modified bySetDescriptor()
.
- mass_method¶
Method used to determine the molecular weight.
Also available as
GetMassMethod()
. May also be modified bySetMassMethod()
.
- model_details¶
Details about how the structure was determined.
Also available as
GetModelDetails()
. May also be modified bySetModelDetails()
.
- model_type_details¶
Details about how the type of the structure was determined.
Also available as
GetModelTypeDetails()
. May also be modified bySetModelTypeDetails()
.
- GetDescriptor()¶
See
descriptor
- SetDescriptor(descriptor)¶
See
descriptor
- GetMassMethod()¶
See
mass_method
- SetMassMethod(method)¶
See
mass_method
- GetModelDetails()¶
See
model_details
- SetModelDetails(details)¶
See
model_details
- GetModelTypeDetails()¶
- SetModelTypeDetails(details)¶
- class MMCifInfoObsolete¶
- Holds details on obsolete / superseded structures. The data is
available both in the obsolete and in the replacement entries.
- id¶
Type of change. Either Obsolete or Supersede. Returns a string starting upper case. Has to be set via
OBSLTE
orSPRSDE
.
- pdb_id¶
ID of the replacing entry.
Also available as
GetPDBID()
. May also be modified bySetPDBID()
.
- replace_pdb_id¶
ID of the replaced entry.
Also available as
GetReplacedPDBID()
. May also be modified bySetReplacedPDBID()
.
- GetReplacedPDBID()¶
See
replace_pdb_id
- SetReplacedPDBID(descriptor)¶
See
replace_pdb_id
- class MMCifInfoStructRef¶
Holds the information of the struct_ref category. The category describes the link of polymers in the mmCIF file to sequences stored in external databases such as UniProt. The related categories
struct_ref_seq
andstruct_ref_seq_dif
also list differences between the sequences of the deposited structure and the sequences in the database. Two prominent examples of such differences are point mutations and/or expression tags.- db_name¶
Name of the external database, for example UNP for UniProt.
- Type:
str
- db_access¶
Alternative accession code for the sequence in the database pointed to by
db_name
.- Type:
str
- aligned_seqs¶
List of aligned sequences (all entries of the struct_ref_seq category mapping to this struct_ref) as
MMCifInfoStructRefSeq
.Also available as
GetAlignedSeqs()
.
- GetAlignedSeq(name)¶
Returns the aligned sequence for the given name, None if the sequence does not exist.
- GetAlignedSeqs()¶
See
aligned_seqs
- class MMCifInfoStructRefSeq¶
An aligned range of residues between a sequence in a reference database and the deposited sequence.
- align_id¶
Uniquely identifies every struct_ref_seq item in the mmCIF file.
- Type:
str
- seq_begin¶
- seq_end¶
The starting point (1-based) and end point of the aligned range in the deposited sequence, respectively.
- Type:
int
- db_begin¶
- db_end¶
The starting point (1-based) and end point of the aligned range in the database sequence, respectively.
- Type:
int
- difs¶
List of differences (
MMCifInfoStructRefSeqDif
) between the deposited sequence and the sequence in the database.
- chain_name¶
Chain name of the polymer in the mmCIF file.
- class MMCifInfoStructRefSeqDif¶
A particular difference between the deposited sequence and the sequence in the database.
- seq_rnum¶
The residue number (1-based) of the residue in the deposited sequence
- Type:
int
- db_rnum¶
The number of the residue in the database sequence or ‘?’ if ‘struct_ref_seq_dif.pdbx_seq_db_seq_num’ was missing.
- Type:
str
- details¶
A textual description of the difference, e.g. point mutation, expression tag, purification artifact.
- Type:
str
- class MMCifInfoRevisions¶
Revision history of a PDB entry. If you find a ‘?’ somewhere, this means ‘not set’.
- date_original¶
The date when this entry was seen in PDB for the very first time. This is not necessarily the release date. Expected format ‘yyyy-mm-dd’.
- Type:
str
- first_release¶
Index + 1 of the revision releasing this entry. If the value is 0, was not set yet. Set first time we encounter a
GetStatus()
value of “full release” (mmCIF versions < 5) or “Initial release” (current mmCIF).- Type:
int
- AddRevision(num, date, status, major=-1, minor=-1)¶
Add a new iteration to the history.
- Parameters:
num (
int
) – SeeGetNum()
date (
str
) – SeeGetDate()
status (
str
) – SeeGetStatus()
major (
int
) – SeeGetMajor()
minor (
int
) – SeeGetMinor()
- Raises:
Exception if num is <= the last added iteration.
- GetSize()¶
- Returns:
Number of revisions (valid revision indices are in [0, number-1]).
- Return type:
int
- GetDate(i)¶
- Parameters:
i (
int
) – Index of revision- Returns:
Date the PDB revision took place. Expected format ‘yyyy-mm-dd’.
- Return type:
str
- Raises:
Exception if i out of bounds.
- GetNum(i)¶
- Parameters:
i (
int
) – Index of revision- Returns:
Unique identifier of revision (assigned in increasing order)
- Return type:
int
- Raises:
Exception if i out of bounds.
- GetStatus(i)¶
- Parameters:
i (
int
) – Index of revision- Returns:
The status of this revision.
- Return type:
str
- Raises:
Exception if i out of bounds.
- GetMajor(i)¶
- Parameters:
i (
int
) – Index of revision- Returns:
The major version of this revision (-1 if not set).
- Return type:
int
- Raises:
Exception if i out of bounds.
- GetMinor(i)¶
- Parameters:
i (
int
) – Index of revision- Returns:
The minor version of this revision (-1 if not set).
- Return type:
int
- Raises:
Exception if i out of bounds.
- GetLastDate()¶
- Returns:
Date of the latest revision (‘?’ if no revision set).
- Return type:
str
- GetLastMajor()¶
- Returns:
Major version of the latest revision (-1 if not set).
- Return type:
int
- GetLastMinor()¶
- Returns:
Minor version of the latest revision (-1 if not set).
- Return type:
int
- SetDateOriginal(date)¶
- GetDateOriginal()¶
See
date_original
- GetFirstRelease()¶
See
first_release
- class MMCifInfoEntityBranchLink¶
Data from
pdbx_entity_branch
, most specificallypdbx_entity_branch_link
. That is connectivity information for branched entities, e.g. carbohydrates/ oligosaccharides.Conop Processors
can not easily connect them so we use this information inLoadMMCIF()
to do that.- atom1¶
The first atom of the bond. Corresponds to
entity_branch_link.atom_id_1
,entity_branch_link.comp_id_1
andentity_branch_link.entity_branch_list_num_1
. Also available viaGetAtom1()
andSetAtom1()
.- Type:
- atom2¶
The second atom of the bond. Corresponds to
entity_branch_link.atom_id_2
,entity_branch_link.comp_id_2
andentity_branch_link.entity_branch_list_num_2
. Also available viaGetAtom2()
andSetAtom2()
.- Type:
- bond_order¶
Order of a bond (e.g. 1=single, 2=double, 3=triple). Corresponds to
entity_branch_link.value_order
. Also available viaGetBondOrder()
andSetBondOrder()
.- Type:
int
- ConnectBranchLink(editor)¶
Establish a bond between
atom1
andatom2
of aMMCifInfoEntityBranchLink
.- Parameters:
editor (
XCSEditor
) – The editor instance to call for connecting the atoms.- Returns:
Nothing
- GetBondOrder()¶
See
bond_order
- SetBondOrder()¶
See
bond_order
- class MMCifEntityDesc¶
Data collected for certain mmCIF entity
- type¶
The ost chain type which can be assigned to
ost.mol.ChainHandle
- Type:
- entity_type¶
value of
_entity.type
tokenstr
- entity_poly_type¶
value of
_entity_poly.type
token - empty string if entity is not of type “polymer”str
- branched_type¶
value of
_pdbx_entity_branch.type
token - empty string if entity is not of type “branched”- Type:
str
- details¶
value of
_entity.pdbx_description
tokenstr
- seqres_canonical¶
Canonical SEQRES - empty string if entity is not of type “polymer”. This contains the canonical sequence extracted from the
_entity_poly.pdbx_seq_one_letter_code_can
data item.- Type:
str
- seqres_pdbx¶
PDBx SEQRES - empty string if entity is not of type “polymer”. This contains the sequence extracted from the
_entity_poly.pdbx_seq_one_letter_code
data item. Modifications and non-standard amino acids are represented by their three letter code in brackets, e.g. “(MSE)”- Type:
str
- mon_ids¶
Monomer ids of all residues in a polymer - empty if entity is not of type “polymer”. Read from
_entity_poly_seq
category. If a residue is heterogeneous, this list contains the monomer id that comes first in the CIF file. The other variants end up inhetero_num
/hetero_ids
.- Type:
ost.base.StringList
- hetero_num¶
Residue numbers of heterogeneous compounds - empty if entity is not of type “polymer”. Read from
_entity_poly_seq
category. If a residue is heterogeneous, the monomer id that comes first in the CIF file ends up inmon_ids
. The remnant is listed here. This list specifies the residue numbers for the respective monomer ids inhetero_ids
.
- hetero_ids¶
Monomer ids of heterogeneous compounds - empty if entity is not of type “polymer”. Read from
_entity_poly_seq
category. If a residue is heterogeneous, the monomer id that comes first in the CIF file ends up inmon_ids
. The remnant is listed here. This list specifies the monomer ids for the respective locations inhetero_num
.
Writing mmCIF files¶
Star Writer¶
The syntax of mmCIF is a
subset of the
CIF file syntax, that by
itself is a subset of STAR file syntax. OpenStructure
implements a simple StarWriter
that is able to write data in two ways:
key-value: A category name and an attribute name that is linked to a value. Example:
_citation.year 2024
_citation.year
is called a mmCIF token. It consists of a data category (_citation
) and a data item (year
), delimited by a “.
”.tabular: Represents several values for a mmCIF token. The tokens are written in a header which is followed by the respective values. Example:
loop_ _atom_site.group_PDB _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.label_entity_id _atom_site.label_seq_id _atom_site.label_alt_id _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.auth_seq_id _atom_site.auth_asym_id _atom_site.id _atom_site.pdbx_PDB_ins_code ATOM N N SER A 0 1 . -47.333 0.941 8.834 1.00 52.56 71 P 0 ? ATOM C CA SER A 0 1 . -45.849 0.731 8.796 1.00 53.56 71 P 1 ? ATOM C C SER A 0 1 . -45.191 1.608 7.714 1.00 51.61 71 P 2 ? ...
What follows is an example of how to use the StarWriter
and its
associated objects. In principle thats enough to write a full mmCIF file
but you definitely want to check out the MMCifWriter
which extends
StarWriter
and extracts the relevant data from an OpenStructure
ost.mol.EntityHandle
.
from ost import io
import math
writer = io.StarWriter()
# Add key value pair
value = io.StarWriterValue.FromInt(42)
data_item = io.StarWriterDataItem("_the", "answer", value)
writer.Push(data_item)
# Add tabular data
loop_desc = io.StarWriterLoopDesc("_math_oper")
loop_desc.Add("num")
loop_desc.Add("sqrt")
loop_desc.Add("square")
loop = io.StarWriterLoop(loop_desc)
for i in range(10):
data = list()
data.append(io.StarWriterValue.FromInt(i))
data.append(io.StarWriterValue.FromFloat(math.sqrt(i), 3))
data.append(io.StarWriterValue.FromInt(i*i))
loop.AddData(data)
writer.Push(loop)
# Write this groundbreaking data into a file with name numbers.gz
# and yes, its directly gzipped
writer.Write("numbers", "numbers.gz")
The content of the file written:
data_numbers
_the.answer 42
#
loop_
_math_oper.num
_math_oper.sqrt
_math_oper.square
0 0.000 0
1 1.000 1
2 1.414 4
3 1.732 9
4 2.000 16
5 2.236 25
6 2.449 36
7 2.646 49
8 2.828 64
9 3.000 81
#
- class StarWriterValue¶
A value which is stored as string - must be constructed from static constructor functions
- FromInt(int_val)¶
Static constructor from an integer value
- Parameters:
int_val (
int
) – The value- Returns:
- FromFloat(float_val, decimals)¶
Static constructor from a float value
- Parameters:
float_val (
float
) – The valuedecimals – Number decimals that get stored as internal value
- Returns:
- FromString(string_val)¶
Static constructor from a string value, stores input as is with the exception of the following processing:
set to “?” if string_val is an empty string (in mmCIF, “?” marks “unknown” values)
encapsulate string in quotes if string_val contains space character
encapsulate string in quotes if string_val starts with any of the following special characters: _, #, $, ‘, “, [, ], ;
encapsulate string in quotes if string_val starts with any of the following special strings: “data_” (case insensitive), “save_” (case insensitive)
encapsulate string in quotes if string_val is equal to any of the following reserved words (case insensitive): “loop_”, “stop_”, “global_”
- Parameters:
string_val (
str
) – The value- Returns:
- GetValue()¶
Returns the internal string representation
- class StarWriterDataItem(category, attribute, value)¶
key-value data representation
- Parameters:
category (
str
) – The category name of the data itemattribute (
str
) – The attribute name of the data itemvalue (
StarWriterValue
) – The value of the data item
- GetCategory()¶
Returns category
- GetAttribute()¶
Returns attribute
- GetValue()¶
Returns value
- class StarWriterLoopDesc(category)¶
Defines header for tabular data representation for the specified category
- Parameters:
category (
str
) – The category
- GetCategory()¶
Returns category
- GetSize()¶
Returns number of added attributes
- Add(attribute)¶
Adds an attribute
- Parameters:
attribute (
str
) – The attribute
- GetIndex(attribute)¶
Returns index for specified attribute, -1 if not found
- Parameters:
attribute (
str
) – The attribute for which the index should be returned
- class StarWriterLoop(desc)¶
Allows to populate
StarWriterLoopDesc
with data to get a full tabular data representation- Parameters:
desc (
StarWriterLoopDesc
) – The header
- GetDesc()¶
Returns desc
- GetN()¶
Returns number of added data lists
- AddData(data_list)¶
Add data for each attribute in desc.
- Parameters:
data_list (
list
ofStarWriterValue
) – Data to be added, length must match attributes in desc
- class StarWriter¶
Can be populated with data which can then be written to a file.
- Push(star_writer_object)¶
Push data to be written
- Parameters:
star_writer_object (
StarWriterDataItem
/StarWriterLoop
) – Data
- Write(data_name, filename)¶
Writes pushed data in specified file.
- Parameters:
data_name (
str
) – Name of data block, i.e. the written file starts with data_<data_name>.filename (
str
) – Name of generated file - applies gzip compression in case of .gz suffix.
mmCIF Writer¶
Data categories considered by the OpenStructure mmCIF writer are described in the following. The listed attributes are written to fulfill all dependencies in a mmCIF file according to mmcif_pdbx_v50.
-
group_PDB
type_symbol
label_atom_id
label_asym_id
label_entity_id
label_seq_id
label_alt_id
Cartn_x
Cartn_y
Cartn_z
occupancy
B_iso_or_equiv
auth_seq_id
auth_asym_id
id
pdbx_PDB_ins_code
-
id
type
-
id
entity_id
-
entity_id
type
pdbx_seq_one_letter_code
pdbx_seq_one_letter_code_can
-
entity_id
mon_id
num
hetero
-
asym_id
entity_id
mon_id
seq_id
pdb_strand_id
pdb_seq_num
pdb_ins_code
-
id
type
name
-
symbol
-
entity_id
type
The writer is designed to only require an OpenStructure
ost.mol.EntityHandle
/ ost.mol.EntityView
as input but
optionally performs preprocessing in order to separate residues of chains into
valid mmCIF entities. This is controlled by the mmcif_conform flag which has
significant impact on how chains are assigned to mmCIF entities, chain names and
residue numbers. Ideally, the input is mmcif_conform which is the case
when loading a structure from a valid mmCIF file with ost.io.LoadMMCIF()
.
Behaviour when mmcif_conform is True¶
Expected properties when mmcif_conform is enabled:
The residues in a chain all belong to the same mmCIF molecular entity. That is for example a polypeptide chain with all residues being peptide linking. In mmCIF lingo: An entity of type “polymer” which is of
_entity_poly
type “polypeptide(L)” and all residues being “L-PEPTIDE LINKING”. Well, some glycines might be “PEPTIDE LINKING”. Another example might be a ligand where the chain refers to an entity of type “non-polymer” and only contains that particular ligand.Each chain must have a chain type assigned (available as
ost.mol.ChainHandle.GetType()
) which refers to the entity type. For entity type “polymer” and “branched”, the chain type also encodes the subtypes. If you for example have a polymer chain, not the general CHAINTYPE_POLY is expected but the more finegrained polymer specific type. That could be CHAINTYPE_POLY_PEPTIDE_D. This is also true for entities of type “branched”. There, a subtype such as CHAINTYPE_OLIGOSACCHARIDE is expected.The residue numbers in “polymer” chains must match the SEQRES of the underlying entity with 1-based indexing. Insertion codes are not allowed and raise an error.
Each residue must be named according to the entries in the
ost.conop.CompoundLib
which is provided when callingMMCifWriter.SetStructure()
. This is relevant for the _chem_comp category. If the respective compound cannot be found, the type for that compound is set to “OTHER”
There is one quirk remaining: The assignment of
underlying mmCIF entities. This is a challenge primarily for polymers. The
current logic starts with an empty internal entity list and successively
processes chains. If no match is found, a new entity gets generated and the
SEQRES is set to what we observe in the chain residues given their residue
numbers (i.e. the ATOMSEQ). If the first residue has residue number 10, the
SEQRES gets prefixed by 9 elements using a default value (e.g. UNK for a
chain of type CHAINTYPE_POLY_PEPTIDE_D). The same is done for gaps.
A chain is considered matching an mmCIF entity, if all of its residue names
are an exact match at the respective location in the SEQRES. Location is
determined with residue numbers which follow a 1-based indexing scheme.
However, there might be the case that one chain resolves
more residues than another. So you may have residues at locations that are
undefined in the current SEQRES. If the fraction of matches with undefined
locations does not exceed 5%, we still assume an overall match and fill
in the previsouly undefined locations in the SEQRES with the newly gained
information. This is a heuristic that works in most cases but potentially
introduces errors in entity assignment. If you want to avoid that, you
must set your entities manually and pass a MMCifWriterEntityList
when calling MMCifWriter.SetStructure()
. There is a dedicated
section on that below.
if mmcif_conform is enabled, there is pretty much everything in place and the previously listed mmCIF categories/attributes are written with a few special cases:
_atom_site.auth_asym_id
: Honours the residue string property “pdb_auth_chain_name” if set, uses the actual chain name otherwise. The string property is set in the mmCIF reader._pdbx_poly_seq_scheme.pdb_strand_id: Same behaviour as
_atom_site.auth_asym_id
_atom_site.auth_seq_id: Honours the residue string property “pdb_auth_resnum” if set, uses the actual residue number otherwise. The string property is set in the mmCIF reader.
_pdbx_poly_seq_scheme.pdb_seq_num: Same behaviour as _atom_site.auth_seq_id
_atom_site.pdbx_PDB_ins_code: Honours the residue string property “pdb_auth_ins_code” if set, uses the actual residue insertion code otherwise. The string property is set in the mmCIF reader. If mmcif_conform is enabled, the actual residue insertion code can expected to be empty though.
_pdbx_poly_seq_scheme.pdb_ins_code: Same behaviour as _atom_site.pdbx_PDB_ins_code
Behaviour when mmcif_conform is False¶
If mmcif_conform is not enabled, the only expectation is that residues are
named according to the ost.conop.CompoundLib
which is provided when
calling MMCifWriter.SetStructure()
. The ost.conop.CompoundLib
is
used to extract the respective chem classes (see ost.mol.ChemClass
).
Residues with no entry in the ost.conop.CompoundLib
are set to
UNKNOWN
. There will be significant preprocessing involving the split of
chains which is purely based on these chem classes. Each chain gets split with
the following rules:
separate chain of
_entity.type
“non-polymer” for each residue with chem classNON_POLYMER
/UNKNOWN
if any residue has chem class
WATER
, all of them are collected into one separate chain with_entity.type
“water”if any residue is a saccharide, i.e. has chem class
SACCHARIDE
/L_SACCHARIDE
/D_SACCHARIDE
, all of them are gathered into a single separated chain of_entity.type
“branched” and _pdbx_entity_branch.type “oligosaccharide”.if any residue has chem class
RNA_LINKING
, all of them are collected into one separate chain of_entity.type
“polymer” and _entity_poly.type “polyribonucleotide”.if any residue has chem class
DNA_LINKING
, all of them are collected into one separate chain of_entity.type
“polymer” and _entity_poly.type “polydeoxyribonucleotide”.if any residue is peptide linking, all of them are collected into one separate chain of
_entity.type
“polymer” and _entity_poly.type “polypeptide(L)”/”polypeptide(D)”. We only allow the following combinations of chem classes. EitherL_PEPTIDE_LINKING
/PEPTIDE_LINKING
orD_PEPTIDE_LINKING
/PEPTIDE_LINKING
. MixingL_PEPTIDE_LINKING
andD_PEPTIDE_LINKING
raises an error.
Chain names are generated by iterating over
“ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz”, starting with
AA, AB, AC etc. once the first cycle is through. There can therefore be as many
chains as needed. The mmCIF entities are built the same way as for
mmcif_conform with two differences: 1) the extracted SEQRES of a chain is the
ATOMSEQ, i.e. the exact sequence of its residues 2) entity matching happens
through exact matches of SEQRES and is independent from residue numbers. As a
consequence, the residue numbers written as _atom_site.label_seq_id
do not
correspond anymore to the actual residue numbers but refer to the location in
ATOMSEQ.
Once split and new chain names are assigned, the rest is straightforward.
The special cases listed above (_atom_site.auth_asym_id
,
_pdbx_poly_seq_scheme.pdb_strand_id, _atom_site.auth_seq_id etc.) are
treated the same as if mmcif_conform was true.
To see it all in action:
from ost import io
from ost import conop
ent = io.LoadMMCIF("1a0s", remote=True)
writer = io.MMCifWriter()
# The MMCifWriter is still object of type StarWriter
# I can decorate my mmCIF file with any data I want
val = io.StarWriterValue.FromInt(42)
data_item = io.StarWriterDataItem("_the", "answer", val)
writer.Push(data_item)
# The actual relevant part... mmcif_conform can be set to
# True, as we loaded from mmCIF file
lib = conop.GetDefaultLib()
writer.SetStructure(ent, lib, mmcif_conform = True)
# And write...
writer.Write("1a0s", "1a0s.cif.gz")
Define mmCIF entities¶
The writer provides a way to pre-define mmCIF entities. This only works if
mmcif_conform is enabled and for polymer entities. The problem is that we have
no guarantee to ever see the full SEQRES (written in entity_poly_seq category)
only with a structure as input. As an example: gaps, i.e. missing
residues based on residue numbers, are filled with UNK in case of a
L_PEPTIDE_LINKING
chain. In order to retain the full SEQRES
information, we provide a way to define these polymer entities in form of
MMCifWriterEntity
. The provided entities must fulfill:
They must be of _entity.type “polymer”
All chains in input structure that are of _entity.type “polymer” must be assigned to exactly one of these
MMCifWriterEntity
objects and must match the SEQRES (MMCifWriterEntity.mon_ids
)All chain names that are assigned to any of the
MMCifWriterEntity
objects must be present in input structure
Here is an example with pre-defined mmCIF entities:
from ost import io
from ost import conop
# Read the structure and also seqres and meta information
ent, seqres, info = io.LoadMMCIF("1a0s", remote=True,
seqres=True, info=True)
# we need the compound library at several places
lib = conop.GetDefaultLib()
# pre-define mmCIF entities
entity_info = ost.io.MMCifWriterEntityList()
for entity_id in info.GetEntityIdsOfType("polymer"):
# Get entity description from info object
entity_desc = info.GetEntityDesc(entity_id)
# interface of entity_desc is similar to MMCifWriterEntity
entity_poly_type = entity_desc.entity_poly_type
mon_ids = entity_desc.mon_ids
e = ost.io.MMCifWriterEntity.FromPolymer(entity_poly_type,
mon_ids, lib)
entity_info.append(e)
# search all chains assigned to the entity we just added
for ch in ent.chains:
if info.GetMMCifEntityIdTr(ch.name) == entity_id:
entity_info[-1].asym_ids.append(ch.name)
# deal with heterogeneities
for a,b in zip(entity_desc.hetero_num, entity_desc.hetero_ids):
entity_info[-1].AddHet(a,b)
# write mmcif file with pre-defined mmCIF entities
writer = io.MMCifWriter()
writer.SetStructure(ent, conop.GetDefaultLib(),
entity_info=entity_info)
writer.Write("1a0s", "1a0s.cif.gz")
- class MMCifWriterEntity¶
Defines mmCIF entity which will be written in
MMCifWriter
. Must be created from static constructor function.- FromPolymer(entity_poly_type, mon_ids, compound_lib)¶
Static constructor function for entities of type “polymer”
- Parameters:
entity_poly_type (
str
) – Entity poly type from restricted vocabulary for _entity_poly.typemon_ids (
list
ofstr
) – Full names of all compounds defining the SEQRES of that entitycompound_lib (
ost.conop.CompoundLib
) – Components dictionary from which chem classes are fetched
- type¶
(
str
) The_entity.type
- poly_type¶
(
str
) The _entity_poly.type - empty string if type is not “polymer”
- branch_type¶
- (
str
) The _pdbx_entity_branch.type - empty string if type is not “branched”
- (
- mon_ids¶
(
ost.StringList
) The compound names making up this entity
- seq_olcs¶
(
ost.StringList
) The one letter codes formon_ids
which will be written to_pdbx_seq_one_letter_code
- invalid if type is not “polymer”
- seq_can_olcs¶
(
ost.StringList
) The one letter codes formon_ids
which will be written to_pdbx_seq_one_letter_code_can
- invalid if type is not “polymer”
- asym_ids¶
(
ost.StringList
) Asym chain names that are assigned to this entity
- class MMCifWriterEntityList¶
A list for
MMCifWriterEntity
- class MMCifWriter¶
Inherits all functionality from
StarWriter
and provides functionality to extract relevant mmCIF information fromost.mol.EntityHandle
/ost.mol.EntityView
- SetStructure(ent, compound_lib, mmcif_conform=True,
- entity_info=list())
Extracts mmCIF categories/attributes based on the description above. An object of type
MMCifWriter
can only be associated with one Structure. Calling this function more than once raises an error.- Parameters:
ent (
ost.mol.EntityHandle
/ost.mol.EntityView
) – The stucture to writecompound_lib (
ost.conop.CompoundLib
) – The compound librarymmcif_conform (
bool
) – Determines data extraction strategy as described aboveentity_info (
MMCifWriterEntityList
) – Predefine mmCIF entities - useful to define complete SEQRES. If given, the provided list serves as a starting point, i.e. chains in ent are matched to entities in entity_info. In case of no match, this list gets extended. Starts from empty list if not given.
- GetEntities()¶
Returns
MMCifWriterEntityList
. Useful to check afterSetStructure()
has been called. Order in this list defines entity ids in written mmCIF file with zero based indexing.
Biounits¶
Biological assemblies, i.e. biounits, are an integral part of mmCIF files and
their construction is fully defined in MMCifInfoBioUnit
.
MMCifInfoBioUnit.PDBize()
provides one possibility to construct such biounits
with compatibility with the PDB format in mind. That is single character chain
names, dumping all ligands in one chain etc. For a more mmCIF-style way of
constructing biounits, check out ost.mol.alg.CreateBU()
in the
ost.mol.alg module.