What is Dave doing in Bioinformatics? Pt. 1

Two months and countless draft embryos after initially promising it, here is the first part of an unfathomably long rant describing my field of research. I honestly don’t expect anybody to subject themselves to that read, but at least now I have a place to send those who foolishly ask me about it at cocktail parties.

The short answer is that I do research in Bioinformatics, which is where Mathematics (along with Computer Science and a dozen other disciplines) meet with Biology and Genetics in a dark back-alley, and do all sorts of indescribable things to each other in the hope of: creating a better world, curing cancer, breeding the next race of eugenic übermenschen or making a few bucks for Big Pharma… whichever comes first.

But that sort of answer, while technically correct, does not really tell you why such an unnatural coupling of disciplines was warranted in the first place. Allow me to start at the beginning. Way at the beginning.

[open long semi-relevant digression that can be advantageously replaced by a thorough read on Complexity Theory, if you feel up for the more sciencey and truthy version of things]

The scientific problems of this world tend to fall in either of two categories: those you might eventually solve with a good computer and some time… and those you will never solve exactly, no matter how much crazy sci-fi supercomputing power you throw at them.

This “solvable” vs. “not solvable” demarcation might sound like a tautology, until you understand the full meaning of “never” in the above statement: these, are not problems that might be solved one day, when science progresses far enough or computers get ten, a hundred or a million times faster. These are problems whose solutions require calculation of a complexity that is proven to be beyond the reach of any conventional means of computation in any foreseeable future (“unconventional means” would begin with the discovery of heretofore unknown laws of Physics: in other words, unlikely in your lifetime. at best¹).

By and large, the mathematical complexity of a problem, is the order of time (or computing power) it will take to solve it, relative to its size.

Without calculating the result of a certain task, it is often possible to predict whether producing this result could or could not be done in a reasonable amount of time (where “reasonable” usually means “in less than the age of the universe, assuming the use of every single computer on earth”, or somesuch).

There are countless examples of tasks falling in the first category, “easy” tasks that can be solved quickly, regardless of how big they are. For example, anybody past kindergarten age can presumably add two numbers of practically any size with a piece of paper and a pen. You just add each digit one by one (and, yes, carry the one) and adding two 100-digit numbers will take barely more time than adding two 3-digit numbers.

Now consider a different task: say you are a traveling salesman who needs to plan their next sales route. You have a map of the region, with the towns you must visit and all the distances between them, given in kilometers. How do you find the absolute shortest route that will take you to each city at least once without wasting gas or time?

More to the point: how difficult do you think finding that route will be?

Sure, it sounds easy enough: pick a starting point, follow every roads that go from that city to another one, then onto the next etc. Keep the shortest distance you’ve found. Can’t be that tough, right?

Let’s say there are five cities: you pick a city to start from, then check all remaining four, and from each four, go onto one of the remaining three etc. etc. In total, that’s 5x4x3x2x1 = 120 different paths to compare (that product can also be written using the factorial function: n! = n x (n-1) x … x 3 x 2 x 1. e.g. 5! = 5x4x3x2x1). Not so bad.

What if there are a few more cities… for instance, two times more: 10 cities. That’s 10! = 10x9x8x7x6x5x4x3x2x1 = 3,628,800 paths to look at. Huh, that might take a bit longer to do by hand. No worries: somebody will write a computer program that gives you the answer in a couple seconds.

Except that, you guessed it, each time I double the number of cities, the difficulty does way more than just double.

For 20 cities, the number of paths to look at is: 20! = 2,432,902,008,176,640,000.

For 70, cities, there are 70! (that’s factorial of 70: 70x69x68x…x3x2x1) possible paths to check one by one. That number has exactly 100 digits. This is (very) roughly the number of particles in the entire universe. Assuming you were to put every single computer in the world to work on this, you likely would not be done by the time the Sun explodes.

Of course, you might suspect there are better ways to do this than looking at every possible path one by one. There are. But they are neither easy, nor significantly better. In the end, no matter what method you come up with, it has been proven that it will always require in the order of 2ⁿ operations (where n is the total number of cities). Although it is smaller than n!, 2ⁿ is still a ludicrously huge number, for a large-enough number of cities (“n”). In practice, this puts the Traveling Salesman Problem in the “unsolvable” category: by putting enough cities on a map, one can always ensure that no amount of raw computing power and mathematical genius will ever solve it in a reasonable amount of time.

Another common example is the Knapsack Problem: given a knapsack that can only be filled up to a certain maximal weight, and given a set of items, each with a specific weight and value, how difficult is it to pick the combination of items that will have the highest value, while remaining under the total weight limit. To the despair of jewelry thieves the world over, this problem is equally impossible to solve easily: as soon as the number of items to choose from gets big enough, the number of possible arrangements to look at shoots through the roof (and it is very hard to carry your own supercomputer while robbing a jewelry store).

These two problems (and many others) belong to a class of problems known as NP-Complete. The only two things you might ever want to know about NP-Completeness and NP-Complete problems (beside the fact that unicorns and kittens are considerably more fun to look at) are that:

NP-Complete problems are not just problems that “seem” tough until somebody figures a better way to do them. They are tough. Mofo tough. And provably so. No matter how fast computers keep getting (assuming they keep improving at that rate), they will always fall short of solving this type of problem, for big enough data.
All NP-Complete problems are interconnected. In fact, this is how you usually prove that a problem is NP-Complete: by proving that solving it is as difficult as solving another NP-Complete problem. If by a mathematically implausible stroke of genius you found an easy way to solve one NP-Complete problem, you would have found a way to solve them all.

Most importantly: NP-Complete problems are everywhere.

If the plight of traveling salesman and jewelry robbers throughout the world does not seem to justify wasting your brain cells, keep in mind that their family-friendly formulation is a mere varnish on top of some very ugly raw math definitions, that make them the standard building blocks of countless science and everyday-life problem modeling.

[end digression]

Which finally brings us to Bioinformatics. By way of Biology: Biology is full of hard problems. And by hard, I do not mean “take a long time, need a big microscope and a big computer”, I mean problems that somehow tie to NP-Complete problems. You could say that a lot of things in the real world, end up toward NP-Complete problems, but there are two particularities that make biology/genetics problems more likely to:

1. Biological things love exponential growth. Pretty much every Living Thing in Nature grows exponentially. This tends to greatly affect problems related to the science of Living Things, otherwise known as Biology.

2. In order to be useful, biological problems have to be solved on humongous data sets: whether it is sequences of billions of amino-acids, graphs with thousands of vertices and edges or experimental results involving thousands of genes at once…

If you read the digression above and throw in these two points, you might start seeing why biologists have started running into lots of walls over the past 20-30 years… The universal approach to solving Life Science problems (buying bigger computers or waiting a tad longer for them to pop out the results) no longer works. Identifying protein similarities in a DNA sequence database is not just a needle-and-haystack problem, it is is a needle with a near-infinitely large haystack problem.

Of course, I did omit a major detail in my coarse overview of NP-Completeness above: provided you are fine with an approximate solution, most NP-Problems can be solved very nicely. However, the mathematical tools to solve problems approximately are often very different from the more intuitive (but impractical) techniques that lead to exact solutions. At the sight of the barbaric math involved, most traditional wet lab biologists will usually recoil in horror and tell you to call them when the algorithm that does it is done developing and implementing… Quite understandably, biologists are most happier handling their test tubes, cutting open innocent mice and manning their cool expensive equipment to produce ton after ton of raw data: they didn’t sign up for the boring math and computer stuff that is increasingly necessary to interpret or use these data.

And this, ladies and gentlemen, is how a guy with a background in Math, Computer Science and Artificial Intelligence, who has not touched a microscope since high school, ends up working on protein sequences and drug interaction prediction.

Next week(*), I might even tell you how…

(*) margin of error: +/- 3 weeks.

7 comments

Julian says:

Mon - 10.19.2009 at 08:58 pm

Solving TSP in 2n? Please check your superscripts!
Dave says:

Mon - 10.19.2009 at 09:37 pm

Oops… I need to get it through my head that there is no <exp> HTML tag. Fixed: thanks!

That being said: what business do you have reading that CS 101 drivel 😛
j-ster says:

Fri - 10.23.2009 at 02:03 pm

Is Wikipedia’s list of NP-Complete problems complete?

I remember asking you once about what you were working on, and whilst your explanation was understandable (thank you) and extremely interesting, I had the strong suspicion that every time you answered that question, you lost a bit of your life force energy, and so I never asked it again. None of the above sounds anything like what you described to me back then, but it’s still interesting.
Dr.snail says:

Thu - 10.29.2009 at 04:31 am

Peut-on considerer que
Entre l’infiniment Grand et l’infiniment Petit
il y a beaucoup de NPcomplete problems ?
Dave says:

Fri - 10.30.2009 at 12:15 pm

j-ster: that list is not complete, no… 😉 if only because there is an infinite number of NP-complete problems: each one being merely a variation (“reduction” in technical terms) to another one, you could theoretically create a new NP-complete problem from an existing one by tweaking a few things.

The discrepancy between the description above and what I may have told you in the past, stems from the fact we probably had that conversation more than 2 years ago, when my field of study was entirely different, with practical applications that veered more toward Natural Language Processing (“computational linguistics”)…
Pingback: Dave's Blog » Blog Archive » Where is my Olympic Discipline?
Pingback: Dave's Blog » Blog Archive » What is Dave doing in Bioinformatics? Pt. 2

Comments are closed.

Dave’s Logs

Life does not cease to be funny when people die any more than it ceases to be serious when people laugh.

What is Dave doing in Bioinformatics? Pt. 1

7 comments