In yesterday’s post about biomining, I mentioned I was involved in a second type of mining and that’s the subject of today’s post: genome mining.
Last week I submitted a manuscript that I had written to a journal for consideration for publication concerning the genome sequence of a strain of Methylomonas methanica. The manuscript has 27 authors based at 8 institutions in 5 different countries and is the biggest project in terms of number of authors that I have been involved in and, of those 27 authors, I’m at the front, because I pulled most of it together in the end, though I started out about 5th and gradually moved forwards, particularly as I ended up writing the manuscript itself - that wasn’t planned originally but that’s how it ended up. This is the second genome sequence that I have worked on - the first being that of Methylophaga thiooxydans, which I published earlier this year with a much smaller team.
Both of these organisms are Bacteria found in the marine environment and, since they both have “Methylo-” generic names, they evidently have a lot in common. Members of the genus Methylomonas are methanotrophic, which means that they can grow on methane as a source of carbon. There is an apparent peculiarity amongst a lot of Bacteria that can grow on methane in that they seem to depend upon it - they’ve lost the ability to grow on pretty much anything else. Personally, I don’t think this makes evolutionary sense and I’m sure they can use other things though refuse to do so in the lab. Members of the genus Methylophaga are still fussy eaters but far less so - they can’t grow on methane but they can grow on methanol, methylated amines, dimethylsulfide and a lot of other “one-carbon” or “C1” compounds, in addition to fructose and a few other multicarbon compounds - they are methylotrophic. These C1 compounds are prevalent in seawater (yes, even methane) and it makes sense, therefore, that there is something out there eating them. After decades of C1 research around the world, we now know which enzymes and genes are responsible for the metabolism of a lot of these compounds but we’re still struggling with some of the details - which is why the genome sequences are so useful.
Genome sequencing has moved FAST - when the Methylophaga thiooxydans sequence was done, it took almost a year and cost A Lot Of Money - it’s still not finished - I think about 10% of the sequence is missing because we didn’t pay the extra money needed to close the sequence (genomes from the Bacteria are circular, so finishing them off requires closing the circle). When Methylomonas methanica was done, it took weeks, not months and cost far, far less than the M. thiooxydans one! Once all of the DNA sequence has been obtained and checked for errors and that it has been assembled properly (the DNA is physically chopped into pieces and each of these pieces is sequenced and then reassembled with a computer), the next stage is a bit like a word-search and that is Finding The Genes. In bacterial genetics, almost all genes start and end in very particular ways and we use powerful computers to sift through the sequences to find all of these and mark them off. To put this into perspective, the genome I have just been working on had just over 5,000,000 bases (letters C, G, A or T) and about 5,000 predicted genes. If each gene has a set start and stop, that’s 10,000 starts and stops, each of which is 3 bases long, so that’s 30,000 bases to look for out of 5,000,000 - it’s somewhere between a word-search and finding a needle in a haystack.
Once all of these genes have been predicted, a computer looks at each one and compares the sequence of letters to known genes in public databases - for example, in an imagine genome, if we had the sequence:
GAA TCC GCC AAA AAA CCT CCT GGC
a computer could find a 100% identical match in a known genome that was for a gene that had been demonstrated experimentally to encode a stress protein - that’s easy - we can confidently say our gene encodes the same stress protein…but this doesn’t happen often. Often, we’d find the nearest thing with any resemblance was:
GAA GGG GGG AAA AAA CCT CCT GGC
That’s 5 differences out of 24 positions - 79% identity. We would have to then say that our gene might encode a stress protein. Now, here’s another potential match:
GAA TCC GCC TTT TTT CCT CCT GGC
6 differences out of 24 positions this time - 75% identity, which isn’t bad. But what if the gene in the database was for a stress protein and it had been shown experimentally that a pair of amino acids encoded by the sequence “TTT” were absolutely essential for it to be functional - our gene has 75% identity but doesn’t have this motif - therefore, we can’t really say it’s a stress protein - it doesn’t have the key bit of the sequence, even though the rest is all correct.
Scenarios like these happen all the time when a computer is annotating a genome by comparing each gene to a public database and so, when the genome sequence has been completely annotated, that’s where genome mining comes in - to check it’s all ok! If I were interested in genes encoding a particular protein, the first thing I would do is search the database of our new genome with a text string such as “methanol dehydrogenase” found it? No - then I’d try other names I know it could be called “PQQ-dependent dehydrogenase” or “alcohol dehydrogenase” - anything? Still nothing - now I go and find a methanol dehydrogenase gene sequence from another bug and search through the database for it - got it! It’s hiding under “dehydrogenase” - it’s the right size, it has all the correct motifs and 98% identity to a known one - that seems ok! Now on to the next gene…and so on…
It takes quite a long time to mine each gene and to check all of the important motifs are present along with the other required stuff - even though the computers themselves do the initial part of the annotation of the genome themselves without human help, there is no substitute for a scientist’s eyes and brain when it comes to mining the genome properly and making sure it’s correct. A computer can only do a job as good as the programme it is running and even with a perfect programme, it wouldn’t know the science - it wouldn’t know, for example, that methanol dehydrogenase has two subunits and some accessory proteins and needs PQQ as a cofactor - so the genes to synthesise that have to be in the genome too for it to make sense.
We rely on technology -and need it for many parts of genome sequencing - but we still need to sift through the sequences by hand, checking it’s all been put together properly and that it makes scientific sense. The time has not yet come for which a computer can check for scientific sense, since a computer can’t have ideas or go off and read a paper and understand why some genes are the way they are. I hope we never manage to build a computer that powerful - none of us humans would be needed ever again.
0 comments:
Post a Comment
Please say stuff.