While using a std::map
is fine or using a 256-sized char table would be fine, you could save yourself an enormous amount of space agony by simply using an enum
. If you have C++11 features, you can use enum class
for strong-typing:
// First, we define base-pairs. Because regular enums
// Pollute the global namespace, I'm using "enum class".
enum class BasePair {
A,
T,
C,
G
};
// Let's cut out the nonsense and make this easy:
// A is 0, T is 1, C is 2, G is 3.
// These are indices into our table
// Now, everything can be so much easier
BasePair Complimentary[4] = {
T, // Compliment of A
A, // Compliment of T
G, // Compliment of C
C, // Compliment of G
};
Usage becomes simple:
int main (int argc, char* argv[] ) {
BasePair bp = BasePair::A;
BasePair complimentbp = Complimentary[(int)bp];
}
If this is too much for you, you can define some helpers to get human-readable ASCII characters and also to get the base pair compliment so you're not doing (int)
casts all the time:
BasePair Compliment ( BasePair bp ) {
return Complimentary[(int)bp]; // Move the pain here
}
// Define a conversion table somewhere in your program
char BasePairToChar[4] = { 'A', 'T', 'C', 'G' };
char ToCharacter ( BasePair bp ) {
return BasePairToChar[ (int)bp ];
}
It's clean, it's simple, and its efficient.
Now, suddenly, you don't have a 256 byte table. You're also not storing characters (1 byte each), and thus if you're writing this to a file, you can write 2 bits per Base pair instead of 1 byte (8 bits) per base pair. I had to work with Bioinformatics Files that stored data as 1 character each. The benefit is it was human-readable. The con is that what should have been a 250 MB file ended up taking 1 GB of space. Movement and storage and usage was a nightmare. Of coursse, 250 MB is being generous when accounting for even Worm DNA. No human is going to read through 1 GB worth of base pairs anyhow.