Category: Geek

You know you are in a Bioinformatics lab when…

Fri - 3.5.2010 by Dave·Comments Off

Lab sysadmin recently announced that, due to excessive use and lack of server space, individual user directories were now capped…

… at 4 terabytes…

… per user.

Bioinformatics Conference Pro Tip

Tue - 12.15.2009 by Dave·2 Comments

Can we all agree already to skip the “Why research on cancer is useful” introduction slide from now on?

If your talk lasts 25 minutes and goes into the minutiae of protein-protein interactions with regard to oncogenic pathways, maybe spending half of it convincing an audience of biologists and bioinformaticians that cancer is a bad thing that needs curing is not the best use of presentation time.

Kanji Stories…

Wed - 11.18.2009 by Dave·4 Comments

After keeping it on the back burner for way too long, I felt I should finally make this project public, no matter how unpolished:

KanjiStory.com is a website geared towards people studying Japanese kanji (and, I guess, to a lesser extent, Chinese… but it probably needs some tuning for that). It provides a simple yet powerful interface for people to write kanji mnemonics in the form of a simple story.

The best way to see what I mean is to go register (10 seconds, one click), read a few stories and finally: take a stab at writing your own. Allowing users to contribute stories is at least 90% of the point of this website at this stage, so please do not just go, check out the dozen sample stories and call it a day without trying the editor.

The current version of the website, although quite spiffy code-wise, needs a huge amount of work to be called a proper beta. And then, there are two million cool features just waiting to be added. However, given my very limited time resources, I figured I would first check to see how much interest (and active participation) in the project I can raise, before committing any more time working on it.

Do not hesitate to post your comments and suggestions below, but keep in mind that this is all very early-stage development and that many new features will come, once (if) this ever takes off the ground as a community project.

As an aside, if you are a reasonably experienced PHP dev with an interest in contributing to this project: get in touch (use the address: “zedrdave” at Google’s mail).

If you want to be kept informed of future KanjiStory-related news, easiest way for now is to sign-up as a fan on the facebook page (until I set up a proper forum and RSS feed on the website).

What is Dave doing in Bioinformatics? Pt. 2

Fri - 11.6.2009 by Dave·5 Comments

And we are back on the slow crawl toward eventually explaining what I do, out here in the darker recesses of my lab tucked in the remote Kansai countryside.

Aside from breeding deadly mutant monkeys to serve in my army of evil minions when I kickstart the world-domination part of my plot, that is.

Before I go any further, let me remind the casual reader that: 1) it is most likely nice and sunny out there where you live and you would be considerably better off looking at squirrels running through the trees 2) if you have even the slightest inkling of formal mathematical/computer science training, you will be better served foregoing this edulcorated version in favour of one of the 10 million tutorials and entries on bioinformatics available throughout the internets (Wikipedia being a good place to start). The entry written henceforth is geared at some hypothetical grandparents who would care to know what the fuss with modern Science is all about (for instance mine, were they not already perfectly content in the sole knowledge that the good Lord has put all these tiny amino-acids together in the best possible way of all worlds and that modern genetics is the work of the Devil¹).

In last month’s episode, we laboriously learnt that Biology abounds with really, really, tough problems. Two major points were:

1. For all practical matters, NP-Complete problems are all in the same bag: finding a way to solve one efficiently would mean you can solve any other in roughly the same order of time.

2. Once you have proved that a problem is NP-Complete, trying to find an exact solution for a real-life set of data, is about as meaningful as trying to take down the Everest with a toothpick. There are however plenty of ways to find an approximate solution. Proving NP-Completeness is your cue to start looking for approximation algorithms; and thus the fun begins.

Today, instead of going straight onto the myriad fun ways in which mathematicians solve biology problems, and which one of those I am actually connected to, another digression and an illustration everyone has heard of: genome sequencing.

Full genome sequencing (mapping the entire DNA of a given organism) is one of the earliest application of modern bioinformatics techniques, a seminal example: it starts off as a rather straightforward bio-chemistry problem, soon runs into pesky matters of size, complexity and intractability, goes through a difficult phase of alcohol and substance abuse, but is ultimately saved by the power of Love and Mathematics.

Before I go into the gory details, allow me to dissipate a common misconception about DNA sequencing: it is nowhere as easy as you might have been led to believe by your TV (most people’s preferred source of Science™ facts). Hearing of “DNA tests”, “DNA crime database” and other everyday life DNA-related techniques might make it sound like sequencing is as easy as sending your saliva swab to the lab and waiting a couple days for the results. In reality, despite serious advances, actual full genome sequencing is still a multi-year, multi-million-dollar affair. When people talk about DNA in a forensics or medical context, they are usually looking at a single base nucleotide, located at a precise location on one gene, out of the entire genome. Even cases that require a larger sample of such observations (e.g. DNA matching, when it actually uses sequencing altogether) are still somewhere in the lower hundreds (if that). That’s a mere 100 bases to look at, against 100+ million for the first organism fully sequenced, 10 years ago (make that 3 billions for humans). Quite a difference in scale. And, of course, this is one of those problems where solving twice the size requires much more than twice the time (hopefully by now, this does not surprise you, otherwise you might want to go back and read episode 1 again).

OK, let’s start:

What is Dave doing in Bioinformatics? Pt. 1

Fri - 10.16.2009 by Dave·7 Comments

Two months and countless draft embryos after initially promising it, here is the first part of an unfathomably long rant describing my field of research. I honestly don’t expect anybody to subject themselves to that read, but at least now I have a place to send those who foolishly ask me about it at cocktail parties.

The short answer is that I do research in Bioinformatics, which is where Mathematics (along with Computer Science and a dozen other disciplines) meet with Biology and Genetics in a dark back-alley, and do all sorts of indescribable things to each other in the hope of: creating a better world, curing cancer, breeding the next race of eugenic übermenschen or making a few bucks for Big Pharma… whichever comes first.

But that sort of answer, while technically correct, does not really tell you why such an unnatural coupling of disciplines was warranted in the first place. Allow me to start at the beginning. Way at the beginning.

[open long semi-relevant digression that can be advantageously replaced by a thorough read on Complexity Theory, if you feel up for the more sciencey and truthy version of things]

The scientific problems of this world tend to fall in either of two categories: those you might eventually solve with a good computer and some time… and those you will never solve exactly, no matter how much crazy sci-fi supercomputing power you throw at them.

This “solvable” vs. “not solvable” demarcation might sound like a tautology, until you understand the full meaning of “never” in the above statement: these, are not problems that might be solved one day, when science progresses far enough or computers get ten, a hundred or a million times faster. These are problems whose solutions require calculation of a complexity that is proven to be beyond the reach of any conventional means of computation in any foreseeable future (“unconventional means” would begin with the discovery of heretofore unknown laws of Physics: in other words, unlikely in your lifetime. at best¹).

By and large, the mathematical complexity of a problem, is the order of time (or computing power) it will take to solve it, relative to its size.

Without calculating the result of a certain task, it is often possible to predict whether producing this result could or could not be done in a reasonable amount of time (where “reasonable” usually means “in less than the age of the universe, assuming the use of every single computer on earth”, or somesuch).

There are countless examples of tasks falling in the first category, “easy” tasks that can be solved quickly, regardless of how big they are. For example, anybody past kindergarten age can presumably add two numbers of practically any size with a piece of paper and a pen. You just add each digit one by one (and, yes, carry the one) and adding two 100-digit numbers will take barely more time than adding two 3-digit numbers.

Now consider a different task: say you are a traveling salesman who needs to plan their next sales route. You have a map of the region, with the towns you must visit and all the distances between them, given in kilometers. How do you find the absolute shortest route that will take you to each city at least once without wasting gas or time?

More to the point: how difficult do you think finding that route will be?

Sure, it sounds easy enough: pick a starting point, follow every roads that go from that city to another one, then onto the next etc. Keep the shortest distance you’ve found. Can’t be that tough, right?

Let’s say there are five cities: you pick a city to start from, then check all remaining four, and from each four, go onto one of the remaining three etc. etc. In total, that’s 5x4x3x2x1 = 120 different paths to compare (that product can also be written using the factorial function: n! = n x (n-1) x … x 3 x 2 x 1. e.g. 5! = 5x4x3x2x1). Not so bad.

What if there are a few more cities… for instance, two times more: 10 cities. That’s 10! = 10x9x8x7x6x5x4x3x2x1 = 3,628,800 paths to look at. Huh, that might take a bit longer to do by hand. No worries: somebody will write a computer program that gives you the answer in a couple seconds.

Except that, you guessed it, each time I double the number of cities, the difficulty does way more than just double.

For 20 cities, the number of paths to look at is: 20! = 2,432,902,008,176,640,000.

For 70, cities, there are 70! (that’s factorial of 70: 70x69x68x…x3x2x1) possible paths to check one by one. That number has exactly 100 digits. This is (very) roughly the number of particles in the entire universe. Assuming you were to put every single computer in the world to work on this, you likely would not be done by the time the Sun explodes.

On being a traveling salesman of science…

Mon - 8.3.2009 by Dave·Comments Off

When my advisors (a combination of current and past ones) suggested that I get on the “2 weeks, 5 cities” tour, I was initially very excited.

As it turns out, however, they were not talking about an all-expenses paid tour of Asia and America’s best nightlife spots.

For mild entertainment and posterity value, a ~~few~~ frackload of random tidbits gleaned over the past 10 days and 25,000 miles (counting):

Boston is a nice city. Somewhat nicer than I imagined (was perhaps one of the only major US city I had never been in). At least in the middle of July, when the sun is warm and rain had apparently stopped pouring, just in time for my arrival there. But weather concerns apart, it feels like one of a rare breed of US cities, where you can live (fine) without a car. Which automatically puts it toward the top of my book. It also has lots of nice tree-lined avenues with cute little houses, and plenty of coffeeshops with semi-witty names and lovely US-style breakfasts (baaaacon…) that nearly make up for the filtered sock juice they call coffee…

Coincidentally, and with no bearing on the above statement of appreciation: Everybody in Boston is a 20-something upper-middle-class white person who only wears pastel polo shirts. Really: everybody. Even Asian people there are white. And they wear pastel polo shirts. On their way to one of the 259 Ivy League universities within walking distance of Fenway park.

I am told there are black people living in Boston too.

Study Kanji on your iPhone

Wed - 7.15.2009 by Dave·1 Comment

Taking a small break from my break to introduce:

KanjiBox for iPhone

By far the best way to spend your Summer while improving your Japanese (whether at the beach or on a crowded Tokyo subway, wedged between two sweaty salarymen). This application works on all iPhone and iPod Touch models (provided they run iPhone OS 3.0 or later) and is entirely offline (doesn’t use any internet connection at all).

More info and screenshots on KanjiBox’s website…

And of course, for our ipod-deprived brethren, there is always KanjiBox for Facebook: free and full of fun multiplayer goodness!

Abstract Art or Science?

Tue - 6.9.2009 by Dave·3 Comments

You decide…

Insulin I – by Rattus Norvegicus

Save the fastest running gene in the galaxy!

Tue - 3.31.2009 by Dave·Comments Off

If, like me, you delight in advance at the possibility of one day being diagnosed with an incurable disease linked to a gene named after the world’s most famous blue hedgehog, feel free to circulate the following petition:

Dear fellows at the HUGO Gene Nomenclature Guidelines Committee,

It has recently come to my attention that you have decided to do away with names deemed “inappropriate or offensive” found in the existing international gene nomenclature.

I couldn’t agree more: I always thought that the officially registered name for gene kill-all-the-Jews-and-drink-their-blood was a bit politicised for a scientific setting.

Whatever you do, however, keep your hands off gene SHH, otherwise officially known as sonic hedgehog homolog (Drosophila) gene. We like it the way it is (blue, spiky and running very fast). Beside, what better way to break the news of some potentially fatal gene mutation disease to a kid, than by introducing a beloved computer game mascot!

PS: and for chrissake, drop the Comic Sans font: it makes your world-class gene database website look like it was coded by a 1st year CS student in 1991 (yes, I know: it probably was).

Love,

Can’t help staring…

Wed - 3.11.2009 by Dave·Comments Off

Since about age 5, I have come to grasp with the notion that staring at an oven timer doesn’t make it cook a cake faster.

And yet, I still seem to think that staring at the live output of my painstakingly slow linear optimization program is gonna make it spit its final value faster.

Dave’s Logs

“They’re applauding the DJ, the medium, the beatification of the beat. This is the moment when even the white man starts dancing…”