taw's blog: Diversity of genetic code

Tuesday, January 26, 2010

Diversity of genetic code

I love you otter by splityarn from flickr (CC-NC-SA)

We've all seen the "standard" genetic code - a big table showing which of 64 combinations of 4 RNA bases results in which of 20 amino acids, or the stop codon - it's in every single biology textbook these days. What's rarely mentioned more than in passing is how diverse genetic code really is.

Here's the table I made based on data from NCBI (color-coding entirely random, in case you're wondering; first base left, second base top as usual):

For something supposedly universal, there's surprising amount of variety - a quarter of positions, or 12 of 48, vary. Wait, have I just said 48?

I'll explain. The genetic code is based more on 2.5 nucleotides than on 3. Depending on organism, in something between 7-9 of 16 cases the first two codons fully determine the result and the third codon doesn't matter at all.

No organism is able to tell U from C in the third position, even attempts to make them do so with genetic engineering failed so far. So at most there are three possibilities - UC vs A vs G.

Of these UC vs AG distinction is very common. Most organisms (not even all) can distinguish A from G in third positions occasionally, but they rarely really bother and these distinction are particularly unstable - not a single one of them is universally followed. When A is distinguished from G, usually it's as part of UCA vs G pattern, full three-way discrimination is rare, and UCG vs A even more so.

Some observations:

All codes move 3 nucleotides at a time, and output 1 amino acid. Except even that isn't universally true - it's not shown in this table, ribosomes have ability to take 4 nucleotides instead of 3 in some cases, and for some organisms this is used fairly often, in as many as 10% of all genes. A really good article on origins of life explains how 3-nucleotide code could have plausibly evolved from RNA world.
All codes have codons for each of 20 standard amino acids and stop codon. None seems to have given up on any amino acid, even mitochondria. Something that's not shown here is that a few organisms also code for extra two amino acids - selenocysteine (UGA) and pyrrolysine (UAG).
While all codes have one or more stop codons, none of the stop codons are universal, and they're frequently reused for other purposes.
Start codons are so messy and so context-dependent that I didn't even bother including them in the table. The official story that AUG=Start is not terribly accurate even in humans.
Most of the variety comes from mitochondria - they are in highly peculiar situation of having their own translation mechanism, but very small genomes. A change of code won't affect that many genes, so it's not as likely to cause instant death.
Other organisms with tiny genomes - small viruses - use translation system of their hosts, so they must obey hosts' codes. Some viruses like Mimivirus have some genes related to translation mechanism, but such viruses have also huge genomes, so they cannot easily change their code.
Some variety comes from normal free-living organisms - but these tend to be more minor, usually one of stop codons is reused to code for some amino acid - one of the 20, or as mentioned before selenocysteine or pyrrolysine.
In all likelihood the table will only become more messy as we research more organisms. And we can manipulate the code to make organisms incorporate some even weirder amino acids, so if we included both natural and engineered code, it would be messier than the source code of X11.