Using Binary Files In ProMod3¶
A few features in ProMod3 (and potentially your next addition) require binary files to be loaded and stored. Here, we provide guidelines and describe helper tools to perform tasks related to loading and storing binary files.
Generally, each binary file consists of a short header and binary data. The header ensures consistency between the storing and the loading of data, while the “binary data” is some binary representation of the data of interest.
The main issue, we try to address is that in C++, the binary representation
of objects can be machine- and compiler-dependent. The standard guarantees
though that sizeof(char) = 1
and that std::vector
is contiguous in
memory. Everything else (e.g. sizeof(int)
, endianness, padding of structs)
can vary. Two approaches can be used:
Raw binary data files which are very fast to load, but assume a certain memory-layout for the internal representation of data
Portable binary data files which are slow to load, but do not assume a given memory-layout for the internal representation of data
Portable I/O should always be provided for binary files. If this is too slow for your needs, you can provide functionality for raw binary files. In that case you should still distribute only the portable file and provide a converter which loads the portable file and stores a raw binary file for further use. Storing and loading of raw binary files on the same machine with the same compiler should never be an issue.
For instance, the classes TorsionSampler
,
FragDB
, StructureDB
,
BBDepRotamerLib
and
RotamerLib
use this approach and the conversion is
automatically done in the make
process. Code examples are given in the unit
tests in test_check_io.cc
and test_portable_binary.cc
and in the
C++ code of the classes listed above (see methods Load, Save, LoadPortable and
SavePortable).
File Header¶
The header is written/read with functions provided in the header file
promod3/core/check_io.hh
. The header is written/read before the data
itself and is structured as follows:
a “magic number” (ensures that we can read
uint32_t
which is needed for the following fields)a version number (allows for backwards-compatibility)
sizes for all types which are treated as raw memory (i.e. casted to a byte (
char
) array and written either to memory or to a stream)example values for the used base-types (ensures we can e.g. read an
int
)
For portable I/O (see below), we only write/read fixed-width fundamental
data-types (e.g. int32_t
, float
). Hence, we only check if we can
read/write those types.
When data is converted from a non-fixed fundamental type T
(e.g. uint
,
short
, Real
), we furthermore ensure that the used fixed-width type
(size written to file) is <= sizeof(T)
.
All write functions (when saving a binary) should be mirrored by the corresponding check (or get) function in the exact same order when loading.
All functions are templatized to work with any OST-like data sink or source
and overloaded to work with std::ofstream
and std::ifstream
.
Portable Binary Data¶
Portable files are written/read with functions and classes provided in the
header file promod3/core/portable_binary_serializer.hh
.
Generally, we store any data-structure value-by-value as fixed-width types!
Writing and reading is performed by the following classes:
PortableBinaryDataSink
to write files (opened asstd::ofstream
)PortableBinaryDataSource
to read files (opened asstd::ifstream
)
Each serializable class must define a Serialize
function that accepts sinks
and sources, such as:
template <typename DS>
void Serialize(DS& ds) {
// serialize element-by-element
}
Or if this is not possible for an object of type T
, we need to define
global functions such as:
inline void Serialize(core::PortableBinaryDataSource& ds, T& t) { }
inline void Serialize(core::PortableBinaryDataSink& ds, T t) { }
Given a sink or source object ds
, we read/write an object v
as:
ds & v
, ifv
is an instance of a class, abool
or any fixed-width type (e.g.char
,int_32_t
,float
)core::ConvertBaseType<T>(ds, v)
, whereT
is a fixed-width type.v
will then be converted to/fromT
. This is needed for any non-fixed fundamental type (e.g.uint
,short
,Real
).
Implementation notes:
the
Serialize
function for fundamental types takes care of endianness (all written as little endian and converted from/to native endianness)custom
Serialize
functions exist forString
(=std::string
),std::vector<T>
andstd::pair<T,T2>
. It will throw an error if the used typeT
orT2
is a fundamental type. In that case, you have to serialize the values manually and convert each element appropriately.you can use
ds.IsSource()
to distinguish sources and sinks.
Code Example¶
Here is an example of a class which provides functionality for portable and non-portable I/O:
// includes for this class
#include <boost/shared_ptr.hpp>
#include <iostream>
#include <fstream>
#include <sstream>
#include <vector>
// includes for I/O
#include <promod3/core/message.hh>
#include <promod3/core/portable_binary_serializer.hh>
#include <promod3/core/check_io.hh>
using namespace promod3;
// define some data-structure
struct SomeData {
short s;
int i;
Real r;
// portable serialization
// (cleanly element by element with fixed-width base-types)
template <typename DS>
void Serialize(DS& ds) {
core::ConvertBaseType<int16_t>(ds, s);
core::ConvertBaseType<int32_t>(ds, i);
core::ConvertBaseType<float>(ds, r);
}
};
// define pointer type
class MyClass;
typedef boost::shared_ptr<MyClass> MyClassPtr;
// define class
class MyClass {
public:
MyClass(const String& id): id_(id) { }
// raw binary save
void Save(const String& filename) {
// open file
std::ofstream out_stream(filename.c_str(), std::ios::binary);
if (!out_stream) {
std::stringstream ss;
ss << "The file '" << filename << "' cannot be opened.";
throw promod3::Error(ss.str());
}
// header for consistency checks
core::WriteMagicNumber(out_stream);
core::WriteVersionNumber(out_stream, 1);
// required base types: short, int, Real (for SomeData).
// uint (for sizes)
// required structs: SomeData
core::WriteTypeSize<uint>(out_stream);
core::WriteTypeSize<short>(out_stream);
core::WriteTypeSize<int>(out_stream);
core::WriteTypeSize<Real>(out_stream);
core::WriteTypeSize<SomeData>(out_stream);
// check values for base types
core::WriteBaseType<uint>(out_stream);
core::WriteBaseType<short>(out_stream);
core::WriteBaseType<int>(out_stream);
core::WriteBaseType<Real>(out_stream);
// write string
uint str_len = id_.length();
out_stream.write(reinterpret_cast<char*>(&str_len), sizeof(uint));
out_stream.write(id_.c_str(), str_len);
// write vector of SomeData
uint v_size = data_.size();
out_stream.write(reinterpret_cast<char*>(&v_size), sizeof(uint));
out_stream.write(reinterpret_cast<char*>(&data_[0]),
sizeof(SomeData)*v_size);
}
// raw binary load
static MyClassPtr Load(const String& filename) {
// open file
std::ifstream in_stream(filename.c_str(), std::ios::binary);
if (!in_stream) {
std::stringstream ss;
ss << "The file '" << filename << "' does not exist.";
throw promod3::Error(ss.str());
}
// header for consistency checks
core::CheckMagicNumber(in_stream);
uint32_t version = core::GetVersionNumber(in_stream);
if (version > 1) {
std::stringstream ss;
ss << "Unsupported file version '" << version
<< "' in '" << filename;
throw promod3::Error(ss.str());
}
// check for exact sizes as used in Save
core::CheckTypeSize<uint>(in_stream);
core::CheckTypeSize<short>(in_stream);
core::CheckTypeSize<int>(in_stream);
core::CheckTypeSize<Real>(in_stream);
core::CheckTypeSize<SomeData>(in_stream);
// check values for base types used in Save
core::CheckBaseType<uint>(in_stream);
core::CheckBaseType<short>(in_stream);
core::CheckBaseType<int>(in_stream);
core::CheckBaseType<Real>(in_stream);
// read string (needed for constructor)
uint str_len;
in_stream.read(reinterpret_cast<char*>(&str_len), sizeof(uint));
std::vector<char> tmp_buf(str_len);
in_stream.read(&tmp_buf[0], str_len);
// construct
MyClassPtr p(new MyClass(String(&tmp_buf[0], str_len)));
// read vector of SomeData
uint v_size;
in_stream.read(reinterpret_cast<char*>(&v_size), sizeof(uint));
p->data_.resize(v_size);
in_stream.read(reinterpret_cast<char*>(&p->data_[0]),
sizeof(SomeData)*v_size);
return p;
}
// portable binary save
void SavePortable(const String& filename) {
// open file
std::ofstream out_stream_(filename.c_str(), std::ios::binary);
if (!out_stream_) {
std::stringstream ss;
ss << "The file '" << filename << "' cannot be opened.";
throw promod3::Error(ss.str());
}
core::PortableBinaryDataSink out_stream(out_stream_);
// header for consistency checks
core::WriteMagicNumber(out_stream);
core::WriteVersionNumber(out_stream, 1);
// required base types: short, int, Real
// -> converted to int16_t, int32_t, float
core::WriteTypeSize<int16_t>(out_stream);
core::WriteTypeSize<int32_t>(out_stream);
core::WriteTypeSize<float>(out_stream);
// check values for base types
core::WriteBaseType<int16_t>(out_stream);
core::WriteBaseType<int32_t>(out_stream);
core::WriteBaseType<float>(out_stream);
// write string (provided in portable_binary_serializer.hh)
out_stream & id_;
// write vector (provided in portable_binary_serializer.hh)
// -> only ok like this if vector of custom type
// -> will call Serialize-function for each element
out_stream & data_;
}
// portable binary load
static MyClassPtr LoadPortable(const String& filename) {
// open file
std::ifstream in_stream_(filename.c_str(), std::ios::binary);
if (!in_stream_) {
std::stringstream ss;
ss << "The file '" << filename << "' does not exist.";
throw promod3::Error(ss.str());
}
core::PortableBinaryDataSource in_stream(in_stream_);
// header for consistency checks
core::CheckMagicNumber(in_stream);
uint32_t version = core::GetVersionNumber(in_stream);
if (version > 1) {
std::stringstream ss;
ss << "Unsupported file version '" << version
<< "' in '" << filename;
throw promod3::Error(ss.str());
}
// check for if required base types (see SavePortable)
// are big enough
core::CheckTypeSize<short>(in_stream, true);
core::CheckTypeSize<int>(in_stream, true);
core::CheckTypeSize<Real>(in_stream, true);
// check values for base types used in Save
core::CheckBaseType<int16_t>(in_stream);
core::CheckBaseType<int32_t>(in_stream);
core::CheckBaseType<float>(in_stream);
// read string (needed for constructor)
String s;
in_stream & s;
// construct
MyClassPtr p(new MyClass(s));
// read vector of SomeData
in_stream & p->data_;
return p;
}
private:
std::vector<SomeData> data_;
String id_;
};
int main() {
// generate raw file
MyClassPtr p(new MyClass("HELLO"));
p->Save("test.dat");
// load raw file
p = MyClass::Load("test.dat");
// generate portable file
p->SavePortable("test.dat");
// load portable file
p = MyClass::LoadPortable("test.dat");
return 0;
}
Exisiting Binary Files¶
The following binary files are currently in ProMod3:
module
loop
:frag_db.dat
(FragDB
)structure_db.dat
(StructureDB
)torsion_sampler_coil.dat
(TorsionSampler
)torsion_sampler.dat
(TorsionSampler
)torsion_sampler_extended.dat
(TorsionSampler
)torsion_sampler_helical.dat
(TorsionSampler
)ff_lookup_charmm.dat
(ForcefieldLookup
)
module
scoring
:cbeta_scorer.dat
(CBetaScorer
)cb_packing_scorer.dat
(CBPackingScorer
)hbond_scorer.dat
(HBondScorer
)reduced_scorer.dat
(ReducedScorer
)ss_agreement_scorer.dat
(SSAgreementScorer
)torsion_scorer.dat
(TorsionScorer
)aa_scorer.dat
(AllAtomInteractionScorer
)aa_packing_scorer.dat
(AllAtomPackingScorer
)
module
sidechain
:bb_dep_lib.dat
(BBDepRotamerLib
)lib.dat
(RotamerLib
)
During the make
process, portable versions of the files (stored in the
<MODULE>/data
folder) are converted and corresponding raw binary files
are stored in the stage/share/promod3/<MODULE>_data
folder.
If the stage folder is moved after compilation (e.g. make install
), the
location of the share/promod3
folder is to be stored in an environment
variable called PROMOD3_SHARED_DATA_PATH
. This variable is automatically set
if you load any Python module from promod3
or if you use the pm
script or if you use a well-setup module on a cluster.
Code for the generation of the binary files and their portable versions are
in the extras/data_generation
folder (provided as-is).