Freshly back from two weeks in Kyoto where I attended a Machine Learning Summer School organised at Kyoto University. The whole event was impeccably organised and featured some of the biggest names in the field lecturing on fundamental aspects of machine learning: two highly productive weeks leaving barely enough time in the evening to catch up with friends and enjoy life in my former hometown…

A few random thoughts and observations:

Try as I may, I keep forgetting the kanji for 遺伝子 (いでんし: “gene, genetics”)…

On the other hand I have absolutely no trouble remembering words like 居合い (いあい: “act of drawing one’s katana, killing with a single stroke, shaking out the blood and sheathing the katana back, as one single movement” — yes, it means all that; no, I am not kidding) after seeing them used exactly once.

Which would be great if, you know, there was more assisting with friends’ seppuku and less messing around with genes, in my life right now.

Can we all agree already to skip the “Why research on cancer is useful” introduction slide from now on?

If your talk lasts 25 minutes and goes into the minutiae of protein-protein interactions with regard to oncogenic pathways, maybe spending half of it convincing an audience of biologists and bioinformaticians that cancer is a bad thing that needs curing is not the best use of presentation time.

And we are back on the slow crawl toward eventually explaining what I do, out here in the darker recesses of my lab tucked in the remote Kansai countryside.

Aside from breeding deadly mutant monkeys to serve in my army of evil minions when I kickstart the world-domination part of my plot, that is.

Before I go any further, let me remind the casual reader that: 1) it is most likely nice and sunny out there where you live and you would be considerably better off looking at squirrels running through the trees 2) if you have even the slightest inkling of formal mathematical/computer science training, you will be better served foregoing this edulcorated version in favour of one of the 10 million tutorials and entries on bioinformatics available throughout the internets (Wikipedia being a good place to start). The entry written henceforth is geared at some hypothetical grandparents who would care to know what the fuss with modern Science is all about (for instance mine, were they not already perfectly content in the sole knowledge that the good Lord has put all these tiny amino-acids together in the best possible way of all worlds and that modern genetics is the work of the Devil1).

In last month’s episode, we laboriously learnt that Biology abounds with really, really, tough problems. Two major points were:

1. For all practical matters, NP-Complete problems are all in the same bag: finding a way to solve one efficiently would mean you can solve any other in roughly the same order of time.

2. Once you have proved that a problem is NP-Complete, trying to find an exact solution for a real-life set of data, is about as meaningful as trying to take down the Everest with a toothpick. There are however plenty of ways to find an approximate solution. Proving NP-Completeness is your cue to start looking for approximation algorithms; and thus the fun begins.

Today, instead of going straight onto the myriad fun ways in which mathematicians solve biology problems, and which one of those I am actually connected to, another digression and an illustration everyone has heard of: genome sequencing.

Full genome sequencing (mapping the entire DNA of a given organism) is one of the earliest application of modern bioinformatics techniques, a seminal example: it starts off as a rather straightforward bio-chemistry problem, soon runs into pesky matters of size, complexity and intractability, goes through a difficult phase of alcohol and substance abuse, but is ultimately saved by the power of Love and Mathematics.

Before I go into the gory details, allow me to dissipate a common misconception about DNA sequencing: it is nowhere as easy as you might have been led to believe by your TV (most people’s preferred source of Science™ facts). Hearing of “DNA tests”, “DNA crime database” and other everyday life DNA-related techniques might make it sound like sequencing is as easy as sending your saliva swab to the lab and waiting a couple days for the results. In reality, despite serious advances, actual full genome sequencing is still a multi-year, multi-million-dollar affair. When people talk about DNA in a forensics or medical context, they are usually looking at a single base nucleotide, located at a precise location on one gene, out of the entire genome. Even cases that require a larger sample of such observations (e.g. DNA matching, when it actually uses sequencing altogether) are still somewhere in the lower hundreds (if that). That’s a mere 100 bases to look at, against 100+ million for the first organism fully sequenced, 10 years ago (make that 3 billions for humans). Quite a difference in scale. And, of course, this is one of those problems where solving twice the size requires much more than twice the time (hopefully by now, this does not surprise you, otherwise you might want to go back and read episode 1 again).

OK, let’s start:

  1. I know: one is not supposed to capitalise the name of God’s evil nemesis, but I am going on the assumption that Satan is a vindictive bastard and one can never be too prudent in courting the good graces of major players of the afterworld. []

Two months and countless draft embryos after initially promising it, here is the first part of an unfathomably long rant describing my field of research. I honestly don’t expect anybody to subject themselves to that read, but at least now I have a place to send those who foolishly ask me about it at cocktail parties.

The short answer is that I do research in Bioinformatics, which is where Mathematics (along with Computer Science and a dozen other disciplines) meet with Biology and Genetics in a dark back-alley, and do all sorts of indescribable things to each other in the hope of: creating a better world, curing cancer, breeding the next race of eugenic übermenschen or making a few bucks for Big Pharma… whichever comes first.

But that sort of answer, while technically correct, does not really tell you why such an unnatural coupling of disciplines was warranted in the first place. Allow me to start at the beginning. Way at the beginning.

[open long semi-relevant digression that can be advantageously replaced by a thorough read on Complexity Theory, if you feel up for the more sciencey and truthy version of things]

The scientific problems of this world tend to fall in either of two categories: those you might eventually solve with a good computer and some time… and those you will never solve exactly, no matter how much crazy sci-fi supercomputing power you throw at them.

This “solvable” vs. “not solvable” demarcation might sound like a tautology, until you understand the full meaning of “never” in the above statement: these, are not problems that might be solved one day, when science progresses far enough or computers get ten, a hundred or a million times faster. These are problems whose solutions require calculation of a complexity that is proven to be beyond the reach of any conventional means of computation in any foreseeable future (“unconventional means” would begin with the discovery of heretofore unknown laws of Physics: in other words, unlikely in your lifetime. at best1).

By and large, the mathematical complexity of a problem, is the order of time (or computing power) it will take to solve it, relative to its size.

Without calculating the result of a certain task, it is often possible to predict whether producing this result could or could not be done in a reasonable amount of time (where “reasonable” usually means “in less than the age of the universe, assuming the use of every single computer on earth”, or somesuch).

There are countless examples of tasks falling in the first category, “easy” tasks that can be solved quickly, regardless of how big they are. For example, anybody past kindergarten age can presumably add two numbers of practically any size with a piece of paper and a pen. You just add each digit one by one (and, yes, carry the one) and adding two 100-digit numbers will take barely more time than adding two 3-digit numbers.

Now consider a different task: say you are a traveling salesman who needs to plan their next sales route. You have a map of the region, with the towns you must visit and all the distances between them, given in kilometers. How do you find the absolute shortest route that will take you to each city at least once without wasting gas or time?

More to the point: how difficult do you think finding that route will be?

Sure, it sounds easy enough: pick a starting point, follow every roads that go from that city to another one, then onto the next etc. Keep the shortest distance you’ve found. Can’t be that tough, right?

Let’s say there are five cities: you pick a city to start from, then check all remaining four, and from each four, go onto one of the remaining three etc. etc. In total, that’s 5x4x3x2x1 = 120 different paths to compare (that product can also be written using the factorial function: n! = n x (n-1) x … x 3 x 2 x 1. e.g. 5! = 5x4x3x2x1). Not so bad.

What if there are a few more cities… for instance, two times more: 10 cities. That’s 10! = 10x9x8x7x6x5x4x3x2x1 = 3,628,800 paths to look at. Huh, that might take a bit longer to do by hand. No worries: somebody will write a computer program that gives you the answer in a couple seconds.

Except that, you guessed it, each time I double the number of cities, the difficulty does way more than just double.

For 20 cities, the number of paths to look at is: 20! = 2,432,902,008,176,640,000.

For 70, cities, there are 70! (that’s factorial of 70: 70x69x68x…x3x2x1) possible paths to check one by one. That number has exactly 100 digits. This is (very) roughly the number of particles in the entire universe. Assuming you were to put every single computer in the world to work on this, you likely would not be done by the time the Sun explodes.

  1. Yea, yea, I know… “Quantum Computing”. Let’s wait and see when we get there, ok? []

When my advisors (a combination of current and past ones) suggested that I get on the “2 weeks, 5 cities” tour, I was initially very excited.

As it turns out, however, they were not talking about an all-expenses paid tour of Asia and America’s best nightlife spots.

For mild entertainment and posterity value, a few frackload of random tidbits gleaned over the past 10 days and 25,000 miles (counting):

Boston is a nice city. Somewhat nicer than I imagined (was perhaps one of the only major US city I had never been in). At least in the middle of July, when the sun is warm and rain had apparently stopped pouring, just in time for my arrival there. But weather concerns apart, it feels like one of a rare breed of US cities, where you can live (fine) without a car. Which automatically puts it toward the top of my book. It also has lots of nice tree-lined avenues with cute little houses, and plenty of coffeeshops with semi-witty names and lovely US-style breakfasts (baaaacon…) that nearly make up for the filtered sock juice they call coffee…

Coincidentally, and with no bearing on the above statement of appreciation: Everybody in Boston is a 20-something upper-middle-class white person who only wears pastel polo shirts. Really: everybody. Even Asian people there are white. And they wear pastel polo shirts. On their way to one of the 259 Ivy League universities within walking distance of Fenway park.

I am told there are black people living in Boston too.