And we are back on the slow crawl toward eventually explaining what I do, out here in the darker recesses of my lab tucked in the remote Kansai countryside.
Aside from breeding deadly mutant monkeys to serve in my army of evil minions when I kickstart the world-domination part of my plot, that is.
Before I go any further, let me remind the casual reader that: 1) it is most likely nice and sunny out there where you live and you would be considerably better off looking at squirrels running through the trees 2) if you have even the slightest inkling of formal mathematical/computer science training, you will be better served foregoing this edulcorated version in favour of one of the 10 million tutorials and entries on bioinformatics available throughout the internets (Wikipedia being a good place to start). The entry written henceforth is geared at some hypothetical grandparents who would care to know what the fuss with modern Science is all about (for instance mine, were they not already perfectly content in the sole knowledge that the good Lord has put all these tiny amino-acids together in the best possible way of all worlds and that modern genetics is the work of the Devil1I know: one is not supposed to capitalise the name of God’s evil nemesis, but I am going on the assumption that Satan is a vindictive bastard and one can never be too prudent in courting the good graces of major players of the afterworld.).
In last month’s episode, we laboriously learnt that Biology abounds with really, really, tough problems. Two major points were:
1. For all practical matters, NP-Complete problems are all in the same bag: finding a way to solve one efficiently would mean you can solve any other in roughly the same order of time.
2. Once you have proved that a problem is NP-Complete, trying to find an exact solution for a real-life set of data, is about as meaningful as trying to take down the Everest with a toothpick. There are however plenty of ways to find an approximate solution. Proving NP-Completeness is your cue to start looking for approximation algorithms; and thus the fun begins.
Today, instead of going straight onto the myriad fun ways in which mathematicians solve biology problems, and which one of those I am actually connected to, another digression and an illustration everyone has heard of: genome sequencing.
Full genome sequencing (mapping the entire DNA of a given organism) is one of the earliest application of modern bioinformatics techniques, a seminal example: it starts off as a rather straightforward bio-chemistry problem, soon runs into pesky matters of size, complexity and intractability, goes through a difficult phase of alcohol and substance abuse, but is ultimately saved by the power of Love and Mathematics.
Before I go into the gory details, allow me to dissipate a common misconception about DNA sequencing: it is nowhere as easy as you might have been led to believe by your TV (most people’s preferred source of Science™ facts). Hearing of “DNA tests”, “DNA crime database” and other everyday life DNA-related techniques might make it sound like sequencing is as easy as sending your saliva swab to the lab and waiting a couple days for the results. In reality, despite serious advances, actual full genome sequencing is still a multi-year, multi-million-dollar affair. When people talk about DNA in a forensics or medical context, they are usually looking at a single base nucleotide, located at a precise location on one gene, out of the entire genome. Even cases that require a larger sample of such observations (e.g. DNA matching, when it actually uses sequencing altogether) are still somewhere in the lower hundreds (if that). That’s a mere 100 bases to look at, against 100+ million for the first organism fully sequenced, 10 years ago (make that 3 billions for humans). Quite a difference in scale. And, of course, this is one of those problems where solving twice the size requires much more than twice the time (hopefully by now, this does not surprise you, otherwise you might want to go back and read episode 1 again).
OK, let’s start: