<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Dave&#039;s Blog &#187; Bio-info</title>
	<atom:link href="http://unknowngenius.com/blog/archives/category/geek/bio-info/feed/" rel="self" type="application/rss+xml" />
	<link>http://unknowngenius.com/blog</link>
	<description>Chemically-enhanced neural rewiring, on a semi-regular basis...</description>
	<lastBuildDate>Thu, 29 Jul 2010 01:06:24 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>You know you are in a Bioinformatics lab when&#8230;</title>
		<link>http://unknowngenius.com/blog/archives/2010/03/05/you-know-you-are-in-a-bioinformatics-lab-when/</link>
		<comments>http://unknowngenius.com/blog/archives/2010/03/05/you-know-you-are-in-a-bioinformatics-lab-when/#comments</comments>
		<pubDate>Thu, 04 Mar 2010 19:43:03 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Bio-info]]></category>
		<category><![CDATA[Geek]]></category>
		<category><![CDATA[Quickies]]></category>

		<guid isPermaLink="false">http://unknowngenius.com/blog/?p=2839</guid>
		<description><![CDATA[Lab sysadmin recently announced that, due to excessive use and lack of server space, individual user directories were now capped&#8230; &#8230; at 4 terabytes&#8230; &#8230; per user. Post originally published on: Dave's Blog (please leave your comments over there)You know you are in a Bioinformatics lab when&#8230;<p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2010/03/05/you-know-you-are-in-a-bioinformatics-lab-when/">You know you are in a Bioinformatics lab when&#8230;</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Lab sysadmin recently announced that, due to excessive use and lack of server space, individual user directories were now capped&#8230; </p>
<p>&#8230; at 4 <strong><a href="http://en.wikipedia.org/wiki/Terabyte"><em>terabytes</em></a></strong>&#8230;</p>
<p>&#8230; <em>per user</em>.</p>
<p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2010/03/05/you-know-you-are-in-a-bioinformatics-lab-when/">You know you are in a Bioinformatics lab when&#8230;</a></p>
]]></content:encoded>
			<wfw:commentRss>http://unknowngenius.com/blog/archives/2010/03/05/you-know-you-are-in-a-bioinformatics-lab-when/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bioinformatics Conference Pro Tip</title>
		<link>http://unknowngenius.com/blog/archives/2009/12/15/bioinformatics-conference-pro-tip/</link>
		<comments>http://unknowngenius.com/blog/archives/2009/12/15/bioinformatics-conference-pro-tip/#comments</comments>
		<pubDate>Tue, 15 Dec 2009 02:54:44 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Bio-info]]></category>
		<category><![CDATA[Quickies]]></category>

		<guid isPermaLink="false">http://unknowngenius.com/blog/?p=2774</guid>
		<description><![CDATA[Can we all agree already to skip the &#8220;Why research on cancer is useful&#8221; introduction slide from now on? If your talk lasts 25 minutes and goes into the minutiae of protein-protein interactions with regard to oncogenic pathways, maybe spending half of it convincing an audience of biologists and bioinformaticians that cancer is a bad [...]<p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2009/12/15/bioinformatics-conference-pro-tip/">Bioinformatics Conference Pro Tip</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Can we all agree already to skip the &#8220;Why research on cancer is useful&#8221; introduction slide from now on?</p>
<p>If your talk lasts 25 minutes and goes into the minutiae of protein-protein interactions with regard to oncogenic pathways, maybe spending half of it convincing an audience of biologists and bioinformaticians that cancer is a bad thing that needs curing is not the best use of presentation time.</p>
<p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2009/12/15/bioinformatics-conference-pro-tip/">Bioinformatics Conference Pro Tip</a></p>
]]></content:encoded>
			<wfw:commentRss>http://unknowngenius.com/blog/archives/2009/12/15/bioinformatics-conference-pro-tip/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>What is Dave doing in Bioinformatics? Pt. 2</title>
		<link>http://unknowngenius.com/blog/archives/2009/11/06/what-is-dave-doing-in-bioinformatics-pt-2/</link>
		<comments>http://unknowngenius.com/blog/archives/2009/11/06/what-is-dave-doing-in-bioinformatics-pt-2/#comments</comments>
		<pubDate>Fri, 06 Nov 2009 13:27:58 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Academic Adventures]]></category>
		<category><![CDATA[Bio-info]]></category>

		<guid isPermaLink="false">http://unknowngenius.com/blog/?p=2641</guid>
		<description><![CDATA[And we are back on the slow crawl toward eventually explaining what I do, out here in the darker recesses of my lab tucked in the remote Kansai countryside. Aside from breeding deadly mutant monkeys to serve in my army of evil minions when I kickstart the world-domination part of my plot, that is. Before [...]<p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2009/11/06/what-is-dave-doing-in-bioinformatics-pt-2/">What is Dave doing in Bioinformatics? Pt. 2</a></p>
]]></description>
			<content:encoded><![CDATA[<p>And we are back on the slow crawl toward eventually explaining what I do, out here in the darker recesses of my lab tucked in the remote Kansai countryside.</p>
<p>Aside from breeding deadly mutant monkeys to serve in my army of evil minions when I kickstart the world-domination part of my plot, that is.</p>
<p>Before I go any further, let me remind the casual reader that: 1) it is most likely nice and sunny out there where you live and you would be considerably better off looking at squirrels running through the trees 2) if you have even the slightest inkling of formal mathematical/computer science training, you will be better served foregoing this edulcorated version in favour of one of the 10 million tutorials and entries on bioinformatics available throughout the internets (Wikipedia being a good place to start). The entry written henceforth is geared at some hypothetical grandparents who would care to know what the fuss with modern Science is all about (for instance mine, were they not already perfectly content in the sole knowledge that the good Lord has put all these tiny amino-acids together in the best possible way of all worlds and that modern genetics are the work of the Devil<sup>1</sup>).</p>
<p>In <a href="http://unknowngenius.com/blog/archives/2009/10/16/what-is-dave-doing-in-bioinformatics-pt-1/">last month&#8217;s episode</a>, we laboriously learnt that Biology abounds with <a href="http://en.wikipedia.org/wiki/NP-complete#NP-complete_problems">really, <em>really</em>, tough problems</a>. Two major points were:</p>
<p>1. For all practical matters, NP-Complete problems are all in the same bag: finding a way to solve one efficiently would mean you can solve any other in roughly the same order of time.</p>
<p>2. Once you have proved that a problem is NP-Complete, trying to find an exact solution for a real-life set of data, is about as meaningful as trying to take down the Everest with a toothpick. There are however plenty of ways to find an <em>approximate</em> solution. Proving NP-Completeness is your cue to start looking for approximation algorithms; and thus the fun begins.</p>
<p>Today, instead of going straight onto the myriad fun ways in which mathematicians solve biology problems, and which one of those I am actually connected to, another digression and an illustration everyone has heard of: <a href="http://en.wikipedia.org/wiki/Full_genome_sequencing">genome sequencing</a>.</p>
<p>Full genome sequencing (mapping the entire DNA of a given organism) is one of the earliest application of modern bioinformatics techniques, a seminal example: it starts off as a rather straightforward bio-chemistry problem, soon runs into pesky matters of size, complexity and <a href="http://en.wikipedia.org/wiki/Computational_complexity_theory#Intractability">intractability</a>, goes through a difficult phase of alcohol and substance abuse, but is ultimately saved by the power of Love and Mathematics. </p>
<p>Before I go into the gory details, allow me to dissipate a common misconception about DNA sequencing: it is nowhere as easy as you might have been led to believe by your TV (most people&#8217;s preferred source of Science™ facts). Hearing of &#8220;DNA tests&#8221;, &#8220;DNA crime database&#8221; and other everyday life DNA-related techniques might make it sound like sequencing is as easy as sending your saliva swab to the lab and waiting a couple days for the results. In reality, despite <a href="http://en.wikipedia.org/wiki/Full_genome_sequencing#Race_to_commercialization">serious advances</a>, actual full genome sequencing is still a multi-year, multi-million-dollar affair. When people talk about DNA in a forensics or medical context, they are usually looking at a <a href="http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism"><em>single</em> base nucleotide</a>, located at a precise location on <em>one</em> gene, out of the entire genome. Even cases that require a larger sample of such observations (<i>e.g.</i> DNA matching, when it actually uses sequencing altogether) are still somewhere in the lower hundreds (if that). That&#8217;s a mere 100 <a href="http://en.wikipedia.org/wiki/Nucleobase">bases</a> to look at, against 100+ million  for the first <a href="http://en.wikipedia.org/wiki/C_elegans#Genome">organism</a> fully sequenced, 10 years ago (make that 3 billions for <a href="http://en.wikipedia.org/wiki/Human_genome">humans</a>). Quite a difference in scale. And, of course, this is one of those problems where solving twice the size requires <em>much</em> more than twice the time (hopefully by now, this does not surprise you, otherwise you might want to go back and read <a href="http://unknowngenius.com/blog/archives/2009/10/16/what-is-dave-doing-in-bioinformatics-pt-1/">episode 1</a> again).</p>
<p>OK, let&#8217;s start:</p>
<p><span id="more-2641"></span><br />
<h3>The Easy Part</h3>
<p>Long before full-genome sequencing arose, there have been many efficient methods to map <em>small</em> fragments of DNA. How small depends a lot on how efficient, how fast and how reliable you want the result, but 100 bases is a good figure to go with (could be 100 or 1000: it would not matter much when compared to aforementioned billions bases of the human genome). The <a href="http://en.wikipedia.org/wiki/DNA_sequencing#Chain-termination_methods">methods used to do small-scale sequencing</a> are quite nifty, but since they merely rely on boring<sup>2</sup> bio-chemistry stuff, I will let you google around (Wikipedia&#8217;s entry on the topic is not the greatest introduction, for once). For our sake, let&#8217;s just say we have biological tools that can accurately map segments of DNA of 100 bases (or smaller) in a fairly short time with near-perfect accuracy. </p>
<p>Other magical tools that we have at our disposal in our bio-chemistry toolbox allow us to somewhat painlessly:</p>
<p>1. replicate a sequence (of any length) an arbitrary number of times. </p>
<p>2. cut sequences in smaller bits at arbitrary positions: usually upon encountering a specific pattern in the sequence and with a certain frequency rate (<i>e.g.</i>: upon encountering the pattern &#8216;CCCGGG&#8217; anywhere in the sequence, the cut will occur 1 out of 6 times). </p>
<p>3. work on a subsequence at a specific position (as if it was cut from the rest): provided you have an <em>exact</em> pattern where you want to do it (and the pattern is long enough to be unique in the sequence), you can isolate a subsequence from the rest with absolute precision and accuracy. However, engineering such an artificial cut is infinitely more complex, long and costly as using some of the &#8220;ready-made&#8221; cutters of 2.</p>
<p><em>The same with more science and less words, you ask? Sure, here you go: <a href="http://en.wikipedia.org/wiki/Primer_(molecular_biology)">DNA primers</a>, <a href="http://en.wikipedia.org/wiki/DNA_polymerase">DNA polymerase</a>, <a href="http://en.wikipedia.org/wiki/Polymerase_chain_reaction">PCR</a>, <a href="http://en.wikipedia.org/wiki/Restriction_enzyme">restriction enzymes</a>, <a href="http://en.wikipedia.org/wiki/Kangaroo">kangaroos</a><sup>3</sup>.</em></p>
<p>For the sake of keeping things simple (ha!) we will blatantly ignore the fact that DNA is not a single strand, but <a href="http://en.wikipedia.org/wiki/DNA">two complementary sequences tied together</a> that need to be periodically pried apart or rebuilt during these operations. In the end, it does not matter much to the problem at hand.</p>
<p>Equipped with such tools, how should we go about processing entire chromosomes (usually north of 10k bases for simple organisms, 50 million for humans)?</p>
<h3>The Straightforward Method (aka The Doomed Approach)</h3>
<p><a href="http://en.wikipedia.org/wiki/Primer_walking">&#8220;Primer walking&#8221;</a> is a  (comparatively) simple method that more or less consists of focussing on segments small-enough to fit our small-scale sequencing method: start at a given position (see tool 3. above), map a few hundred bases, use the chunk just mapped to find the next starting position, lather, rinse, repeat.</p>
<p>There are two major problems with this method: As mentioned, &#8220;starting at a specific position&#8221; on a sequence is a very <i>ad hoc</i> and therefore costly (in money, in time&#8230;) operation. That operation furthermore requires knowing <em>where</em> to start, which means having processed the previous segment (so we can use the end of this segment as the pattern to start at in the following step).</p>
<p>The entire process therefore requires <em>sequentially</em> executing a series of operations, each of which requires a certain time. Surprisingly no NP-Completeness or exponentials involved here: only a very, very long task with no way to <a href="http://en.wikipedia.org/wiki/Parallelization">parallelize</a>, which is only theoretically better. In practice, this makes this method viable (albeit slow and costly) for sequences of up to 10,000 or so bases. </p>
<p>On longer sequences, and particularly on our friendly multi-billion genomes, that would be a <a href="http://www.youtube.com/watch?v=sD8ma6MCuYY">Fail</a>.</p>
<h3>The Smart Way (aka The One That Gives Headaches)</h3>
<p>From the above, it seems obvious the only way to stand a chance at decoding entire genomes in less than a lifetime is to tackle many small segments, possibly all of them, at the same time. And this is exactly what <a href="http://en.wikipedia.org/wiki/Shotgun_sequencing">shotgun sequencing</a> does: after making a good number of copies of the original sequence, we sic a bunch of ready-made sequence cutters on them. The resulting mixture is a horrible mess of small fragments, that are small enough to be each sequenced easily. Which is great, except you then have no idea how to put them back together so as to get the bigger picture.</p>
<p>Not clear? Need another silly analogy? Sure:</p>
<p>Imagine you have a (very very long) text printed on a piece of paper: make a dozen photocopies and leave them in a room with your 5-year-old nephew and a good pair of scissors (arranging the pages in such a way that each is cut separately, in its own unique way). Put all the resulting confetti in a bag, shake a few times.</p>
<p>Now put the original text back together. Go ahead, I&#8217;ll wait.</p>
<p>Funny enough, this is near-exactly what DNA sequencing entails: once the first phase of sequencing is done, you are given a large bag of short strings (long sequences of &#8216;A&#8217;, &#8216;G&#8217;, &#8216;T&#8217; and &#8216;C&#8217; instead of common words in our analogy, but the same for every other purpose). You are looking for the text that is made of these strings in a specific order (and repeated over, as many times as there were initial copies). An &#8220;easier&#8221; problem is to find <em>any</em> text on which you could fit each little fragments (with lots of overlap, since there are many copies, each cut a different way) without worrying too much about whether this text really is your original. In practice, to get the real solution, all you have to do is find the <em>shortest</em> such text<sup>4</sup>: the one where you have no &#8220;superfluous&#8221; bits of text (not covered by any fragment) and as much overlap as possible (ideally in such a way that each bit of text is covered exactly as many times as there were original copies). </p>
<p>This problem is known as the <a href="http://www.cs.sunysb.edu/~algorith/files/shortest-common-superstring.shtml"><i>shortest common superstring</i></a>. And guess what: <a href="http://en.wikipedia.org/wiki/List_of_NP-complete_problems#Compression_and_representation">remember that little list I mentioned last week</a>? Yea, bummer.</p>
<p>Given a candidate, merely verifying whether you have found the true Shortest Common Superstring is an NP-Complete problem. Finding that string from scratch, belongs to an even harder class of problems known as &#8220;NP-Hard&#8221; (where, as the name lets on, real badass NP problem kids hang out). Given the size of your data (millions of tiny fragments), finding an exact solution is obviously not gonna happen in this lifetime. Neatly enough, solving this problem turns out to be exactly the same as solving a <a href="http://en.wikipedia.org/wiki/Travelling_salesman_problem#Asymmetric_TSP">particular brand of Traveling Salesman Problem</a> (remember?). But as I mentioned last week: where standard algorithm fail and exact solution cannot be found, it is usually still possible to find a decent approximation in a fraction of the time. Sometimes this solution even turns out to be the exact solution, or so close that it doesn&#8217;t matter. This is the case here.</p>
<p><i>And excited as I am sure you were all, to jump into the actual mathematics of solving this particular problem, a sense of human decency and the already astronomical word count on this entry conjointly force me to skip to the end and suggest that the <s>morbidly obsessed</s> bravest among you peruse their local Google to look for some fascinating readings on <a href="http://www.springerlink.com/content/y354x7513025843m/">reducing Shortest Common Superstring to Asymmetric Traveling Salesman Problem</a> and/or <a href="http://scholar.google.com/scholar?cluster=1725645684324835570&#038;hl=en">finding approximation algorithms for the Shortest Common Superstring problem</a>.</i></p>
<p>One of the trick behind Shotgun Sequencing, beside the actual approximation algorithm used to rebuild the original sequence, is the clever use of statistical redundancy to reach a guaranteed level of accuracy. That&#8217;s how we are able to say with certainty that the accuracy for the current Human Genome project is &#8220;no more than 1 error every 100,000 bases on average&#8221; (which sounds an awful lot better than &#8220;fuck if we know, but we dang sure hope it&#8217;s enough to tell it apart from them monkey DNA&#8221;).</p>
<h3>The thankfully brief conclusion</h3>
<p>And there we are, I have exposed, in nearly less time than it takes to kill a grizzly with a feather<sup>5</sup>, how Mathematics and Algorithmic can save the day in Genetics and help ensure that we will all have a bright future made of beautiful blond, three-eyed babies.</p>
<p>Of course, given that Shotgun Sequencing in its earliest incarnation, dates back about 40 years, there is very little left to be done for young upcoming bioinformaticians in this particular direction. And none of my work has anything to do with common sequencing problems.</p>
<p>But next <s>week</s> <s>month</s> soon, I will maybe talk of more recent challenges in bioinformatics, perhaps even some of mine!<sup>6</sup></p>
<ol class="footnotes"><li id="footnote_0_2641" class="footnote">I know: one is not supposed to capitalise the name of God&#8217;s evil nemesis, but I am going on the assumption that Satan is a vindictive bastard and one can never be too prudent in courting the good graces of major players of the afterworld.</li><li id="footnote_1_2641" class="footnote">just kidding, fellow biologist friends: keep looking after those test tubes and churning out my data. kthxbye.</li><li id="footnote_2_2641" class="footnote">Have fun with your friends: can you spot the famous Australian marsupial among these terms related to DNA engineering?</li><li id="footnote_3_2641" class="footnote">the only reason this is not 100% true is the pesky problem of highly repetitive sequences. Something unlikely to arise in a real book, but very common in some DNA.</li><li id="footnote_4_2641" class="footnote">Pro tip: go for that spot under their foot, grizzlies are extremely ticklish there.</li><li id="footnote_5_2641" class="footnote">&#8230; in part 205 to 300 of this entrancing 1400-part series!</li></ol><p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2009/11/06/what-is-dave-doing-in-bioinformatics-pt-2/">What is Dave doing in Bioinformatics? Pt. 2</a></p>
]]></content:encoded>
			<wfw:commentRss>http://unknowngenius.com/blog/archives/2009/11/06/what-is-dave-doing-in-bioinformatics-pt-2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>What is Dave doing in Bioinformatics? Pt. 1</title>
		<link>http://unknowngenius.com/blog/archives/2009/10/16/what-is-dave-doing-in-bioinformatics-pt-1/</link>
		<comments>http://unknowngenius.com/blog/archives/2009/10/16/what-is-dave-doing-in-bioinformatics-pt-1/#comments</comments>
		<pubDate>Thu, 15 Oct 2009 23:32:37 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Academic Adventures]]></category>
		<category><![CDATA[Bio-info]]></category>

		<guid isPermaLink="false">http://unknowngenius.com/blog/?p=2582</guid>
		<description><![CDATA[Two months and countless draft embryos after initially promising it, here is the first part of an unfathomably long rant describing my field of research. I honestly don&#8217;t expect anybody to subject themselves to that read, but at least now I have a place to send those who foolishly ask me about it at cocktail [...]<p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2009/10/16/what-is-dave-doing-in-bioinformatics-pt-1/">What is Dave doing in Bioinformatics? Pt. 1</a></p>
]]></description>
			<content:encoded><![CDATA[<p>Two months and countless draft embryos after initially promising it, here is the first part of an unfathomably long rant describing my field of research. I honestly don&#8217;t expect anybody to subject themselves to that read, but at least now I have a place to send those who foolishly ask me about it at cocktail parties.</p>
<p>The short answer is that I do research in <a href="http://en.wikipedia.org/wiki/Bioinformatics">Bioinformatics</a>, which is where Mathematics (along with Computer Science and a dozen other disciplines) meet with Biology and Genetics in a dark back-alley, and do all sorts of indescribable things to each other in the hope of: creating a better world, curing cancer, breeding the next race of eugenic übermenschen or making a few bucks for Big Pharma&#8230; whichever comes first.</p>
<p>But that sort of answer, while technically correct, does not really tell you why such an unnatural coupling of disciplines was warranted in the first place. Allow me to start at the beginning. <em>Way</em> at the beginning.</p>
<p><strong>[open long semi-relevant digression that can be advantageously replaced by a thorough read on <a href="http://en.wikipedia.org/wiki/Computational_complexity_theory">Complexity Theory</a>, if you feel up for the more sciencey and truthy version of things]</strong></p>
<p>The scientific problems of this world tend to fall in either of two categories: those you might eventually solve with a good computer and some time&#8230; and those you will <em>never</em> solve exactly, no matter how much crazy sci-fi supercomputing power you throw at them.</p>
<p>This &#8220;solvable&#8221; vs. &#8220;not solvable&#8221; demarcation might sound like a tautology, until you understand the full meaning of &#8220;never&#8221; in the above statement: these, are not problems that might be solved one day, when science progresses far enough or computers get ten, a hundred or a million times faster. These are problems whose solutions require calculation of a complexity that is <em>proven</em> to be beyond the reach of any conventional means of computation in any foreseeable future (&#8220;unconventional means&#8221; would begin with the discovery of heretofore unknown laws of Physics: in other words, unlikely in your lifetime. at best).</p>
<p>By and large, the <a href="http://en.wikipedia.org/wiki/Computational_complexity_theory">mathematical complexity</a> of a problem, is the order of time (or computing power) it will take to solve it, relative to its size. </p>
<p>Without calculating the result of a certain task, it is often possible to predict whether producing this result could or could not be done in a reasonable amount of time (where &#8220;reasonable&#8221; usually means &#8220;in less than the age of the universe, assuming the use of every single computer on earth&#8221;, or somesuch).</p>
<p>There are countless examples of tasks falling in the first category, &#8220;easy&#8221; tasks that can be solved quickly, regardless of how big they are. For example, anybody past kindergarten age can presumably add two numbers of practically any size with a piece of paper and a pen. You just add each digit one by one (and, yes, carry the one) and adding two 100-digit numbers will take barely more time than adding two 3-digit numbers.</p>
<p>Now consider a different task: say you are <a href="http://en.wikipedia.org/wiki/Travelling_salesman_problem">a traveling salesman who needs to plan their next sales route</a>. You have a map of the region, with the towns you must visit and all the distances between them, given in kilometers. How do you find the absolute shortest route that will take you to each city at least once without wasting gas or time?</p>
<p>More to the point: how <em>difficult</em> do you think <a href="http://www.tsp.gatech.edu/games/tspOnePlayer.html">finding that route will be</a>?</p>
<p>Sure, it sounds easy enough: pick a starting point, follow every roads that go from that city to another one, then onto the next etc. Keep the shortest distance you&#8217;ve found. Can&#8217;t be that tough, right?</p>
<p>Let&#8217;s say there are five cities: you pick a city to start from, then check all remaining four, and from each four, go onto one of the remaining three etc. etc. In total, that&#8217;s 5x4x3x2x1 = 120 different paths to compare (that product can also be written using the <a href="http://en.wikipedia.org/wiki/Factorial">factorial</a> function: <i>n! = n x (n-1) x &#8230; x 3 x 2 x 1</i>. e.g. <i>5! = 5x4x3x2x1</i>). Not so bad.</p>
<p>What if there are a few more cities&#8230; for instance, two times more: 10 cities. That&#8217;s <i>10! = 10x9x8x7x6x5x4x3x2x1 = 3,628,800</i> paths to look at. Huh, that might take a bit longer to do by hand. No worries: somebody will write a computer program that gives you the answer in a couple seconds.</p>
<p>Except that, you guessed it, each time I double the number of cities, the difficulty does way more than just double. </p>
<p>For 20 cities, the number of paths to look at is: 20! = 2,432,902,008,176,640,000.</p>
<p>For 70, cities, there are 70! (that&#8217;s <em>factorial</em> of 70: 70x69x68x&#8230;x3x2x1) possible paths to check one by one. That number has exactly 100 digits. This is (very) roughly the number of particles in the entire universe. Assuming you were to put every single computer in the world to work on this, you likely would not be done by the time the Sun explodes.</p>
<p><span id="more-2582"></span><br />
Of course, you might suspect there are better ways to do this than looking at every possible path one by one. <a href="http://en.wikipedia.org/wiki/Travelling_salesman_problem#Exact_algorithms">There are</a>. But they are neither easy, nor significantly better. In the end, <em>no matter</em> what method you come up with, it has been proven that it will always require in the order of <i>2<small><sup>n</sup></small></i> operations (where <i>n</i> is the total number of cities). Although it is smaller than <i>n!</i>, <i>2<small><sup>n</sup></small></i> is still a ludicrously huge number, for a large-enough number of cities (&#8220;n&#8221;). In practice, this puts the <a href="http://en.wikipedia.org/wiki/Travelling_salesman_problem">Traveling Salesman Problem</a> in the &#8220;unsolvable&#8221; category: by putting enough cities on a map, one can always ensure that no amount of raw computing power and mathematical genius will ever solve it in a reasonable amount of time.</p>
<p>Another common example is the <a href="http://en.wikipedia.org/wiki/Knapsack_problem">Knapsack Problem</a>: given a knapsack that can only be filled up to a certain maximal weight, and given a set of items, each with a specific weight and value, how difficult is it to pick the combination of items that will have the highest value, while remaining under the total weight limit. To the despair of jewelry thieves the world over, this problem is equally impossible to solve easily: as soon as the number of items to choose from gets big enough, the number of possible arrangements to look at shoots through the roof (and it is very hard to carry your own supercomputer while robbing a jewelry store).</p>
<p>These two problems (and <a href="http://en.wikipedia.org/wiki/List_of_NP-complete_problems">many others</a>) belong to a class of problems known as <a href="http://en.wikipedia.org/wiki/NP-complete">NP-Complete</a>. The only  two things you might ever want to know about NP-Completeness and NP-Complete problems (beside the fact that unicorns and kittens are considerably more fun to look at) are that: </p>
<ul>
<li>NP-Complete problems are not just problems that &#8220;seem&#8221; tough until somebody figures a better way to do them. They are tough. Mofo tough. And provably so. No matter how fast computers keep getting (assuming they keep improving at that rate), they will always fall short of solving this type of problem, for big enough data.</li>
<li>All NP-Complete problems are interconnected. In fact, this is how you usually prove that a problem is NP-Complete: by proving that solving it is as difficult as solving another NP-Complete problem. If by a mathematically implausible stroke of genius you found an easy way to solve one NP-Complete problem, you would have found a way to solve them all.</li>
</ul>
<p>Most importantly: NP-Complete problems are everywhere. </p>
<p>If the plight of traveling salesman and jewelry robbers throughout the world does not seem to justify wasting your brain cells, keep in mind that their family-friendly formulation is a mere varnish on top of some very ugly raw math definitions, that make them the standard building blocks of countless science and everyday-life problem modeling.</p>
<p><strong>[end digression]</strong></p>
<p>Which finally brings us to Bioinformatics. By way of Biology: Biology is <em>full</em> of hard problems. And by hard, I do not mean &#8220;take a long time, need a big microscope and a big computer&#8221;, I mean problems that somehow tie to NP-Complete problems. You could say that a lot of things in the real world, end up toward NP-Complete problems, but there are two particularities that make biology/genetics problems more likely to:</p>
<p>1. Biological things love <a href="http://en.wikipedia.org/wiki/Exponential_growth">exponential growth</a>. Pretty much every Living Thing in Nature grows exponentially. This tends to greatly affect problems related to the science of Living Things, otherwise known as Biology.</p>
<p>2. In order to be useful, biological problems have to be solved on humongous data sets: whether it is <a href="http://en.wikipedia.org/wiki/DNA">sequences of billions of amino-acids</a>, <a href="http://en.wikipedia.org/wiki/Metabolic_pathway">graphs with thousands of vertices and edges</a> or <a href="http://en.wikipedia.org/wiki/DNA_microarray">experimental results involving thousands of genes at once</a>&#8230;</p>
<p>If you read the digression above and throw in these two points, you might start seeing why biologists have started running into lots of walls over the past 20-30 years&#8230; The universal approach to solving Life Science problems (buying bigger computers or waiting a tad longer for them to pop out the results) no longer works. Identifying protein similarities in a DNA sequence database is not just a needle-and-haystack problem, it is is a needle with a near-infinitely large haystack problem.</p>
<p>Of course, I did omit a major detail in my coarse overview of NP-Completeness above: provided you are fine with an <em>approximate</em> solution, most NP-Problems can be solved very nicely. However, the mathematical tools to solve problems approximately are often very different from the more intuitive (but impractical) techniques that lead to exact solutions. At the sight of the barbaric math involved, most traditional wet lab biologists will usually recoil in horror and tell you to call them when the algorithm that does it is done developing and implementing&#8230; Quite understandably, biologists are most happier handling their test tubes, <a href="http://unknowngenius.com/blog/archives/2009/04/09/mandatory-safety-practice-lecture-day/">cutting open innocent mice</a> and manning their cool expensive equipment to produce ton after ton of raw data: they didn&#8217;t sign up for the boring math and computer stuff that is increasingly necessary to interpret or use these data.</p>
<p>And this, ladies and gentlemen, is how a guy with a background in Math, Computer Science and Artificial Intelligence, who has not touched a microscope since high school, ends up working on protein sequences and drug interaction prediction.</p>
<p>Next week(*), I might even tell you how&#8230;</p>
<p>(*) margin of error: +/- 3 weeks.</p>
<p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2009/10/16/what-is-dave-doing-in-bioinformatics-pt-1/">What is Dave doing in Bioinformatics? Pt. 1</a></p>
]]></content:encoded>
			<wfw:commentRss>http://unknowngenius.com/blog/archives/2009/10/16/what-is-dave-doing-in-bioinformatics-pt-1/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>On being a traveling salesman of science&#8230;</title>
		<link>http://unknowngenius.com/blog/archives/2009/08/03/on-being-a-traveling-salesman-of-science/</link>
		<comments>http://unknowngenius.com/blog/archives/2009/08/03/on-being-a-traveling-salesman-of-science/#comments</comments>
		<pubDate>Sun, 02 Aug 2009 20:27:05 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Academic Adventures]]></category>
		<category><![CDATA[Bio-info]]></category>
		<category><![CDATA[Life of a Starving Genius]]></category>

		<guid isPermaLink="false">http://unknowngenius.com/blog/?p=2323</guid>
		<description><![CDATA[When my advisors (a combination of current and past ones) suggested that I get on the &#8220;2 weeks, 5 cities&#8221; tour, I was initially very excited. As it turns out, however, they were not talking about an all-expenses paid tour of Asia and America&#8217;s best nightlife spots. For mild entertainment and posterity value, a few [...]<p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2009/08/03/on-being-a-traveling-salesman-of-science/">On being a traveling salesman of science&#8230;</a></p>
]]></description>
			<content:encoded><![CDATA[<p>When my advisors (a combination of current and past ones) suggested that I get on the &#8220;2 weeks, 5 cities&#8221; tour, I was initially very excited. </p>
<p>As it turns out, however, they were <em>not</em> talking about an all-expenses paid tour of Asia and America&#8217;s best nightlife spots.</p>
<p>For mild entertainment and posterity value, a <s>few</s> frackload of random tidbits gleaned over the past 10 days and 25,000 miles (counting):</p>
<p>Boston is a nice city. Somewhat nicer than I imagined (was perhaps one of the only major US city I had never been in). At least in the middle of July, when the sun is warm and rain had apparently stopped pouring, just in time for my arrival there. But weather concerns apart, it feels like one of a rare breed of US cities, where you can live (fine) without a car. Which automatically puts it toward the top of my book. It also has lots of nice tree-lined avenues with cute little houses, and plenty of coffeeshops with semi-witty names and lovely US-style breakfasts (<em>baaaacon</em>&#8230;) that nearly make up for the filtered sock juice they call coffee&#8230;</p>
<p>Coincidentally, and with no bearing on the above statement of appreciation: Everybody in Boston is a 20-something upper-middle-class white person who only wears pastel polo shirts. Really: <em>everybody</em>. Even Asian people there are white. And they wear pastel polo shirts. On their way to one of the 259 Ivy League universities within walking distance of Fenway park.</p>
<p><em>I am told</em> there are black people living in Boston too.</p>
<p><span id="more-2323"></span><br />
Upon confirmation, my sources insisted that they <em>really</em> meant African-American, not Indian (Indians are also white in Boston, but some of them do wear white short-sleeves instead of pastel polo shirts).</p>
<p>In between a lot of half-mumbled droning and powerpoint pr0n, centered around the life and death of proteins, <a href="http://en.wikipedia.org/wiki/Caenorhabditis_elegans">worms</a>, <a href="http://en.wikipedia.org/wiki/E_coli">projectile vomit-inducing bacteria</a> and many others, I did learn a couple cool things. </p>
<p>Did you guys know that the differences in some pathways of <a href="http://en.wikipedia.org/wiki/Circadian_cycle">circadian cycles</a>, from one individual to the next, have been scientifically linked to sleep dephasing issues? In Time Magazine-talk: Science has proved that there is such a thing as &#8220;morning-people&#8221; and &#8220;evening-people&#8221;. I know, it&#8217;s neat, isn&#8217;t it.</p>
<p>In fact, all that revolves around circadian cycles, and the incredible chemistry and mathematics involved in creating <em>multiple</em> near-perfect body clocks, each running on their own period, is quite awesome and fascinating.</p>
<p>That being said, scheduling a talk on circadian cycles in the middle of a hot Summer afternoon, following half a day of presentations and one heavy american lunch, is as close an attempt at humour as one may get in such places. </p>
<p>If you have no idea what that paragraph above was about, even after reading <a href="http://en.wikipedia.org/wiki/Circadian_cycle#Importance_in_animals">this</a>: congratulation on most likely going through High School without getting beaten up every week after PE.</p>
<p>Guess who the computer at Kansai International Airport selected as the lucky winner of a voucher for a once-in-a-lifetime, super-extra-fun, &#8220;additional security controls and full body check before boarding&#8221; prize.</p>
<p>Guess who, out of a rough 50 speakers presenting their work along the week, had the honour of giving the very last presentation of the last day of the conference, hot on the tail of a mere 8 hours of talks.</p>
<p>Hint: those were both the same person.</p>
<p>I am still not quite sure what miraculous circumstances were involved in yours truly not being booed off the stage of Boston University&#8217;s School of Biomedical Engineering Sciences. But the favoured hypothesis revolves around liberal use of the mesmerizing &#8220;blow up&#8221; transition effect in his Keynote presentation.</p>
<p>My one and only regret about the patent-pending &#8220;chocolate pretzel&#8221; finger food made available in the conference rec room: not bringing back any, so I can present them to the next person who dares criticising British food. Say all you want about the blandness of lamb in mint sauce: no sane Britton has ever willfully tried to make food by dipping a bread-like substance in salt, <em>then</em> chocolate. </p>
<p>I am still open to the idea that the whole thing was an industrial mistake (&#8220;Damn, we completely forgot to use <em>unsalted</em> pretzels for the chocolate batch&#8221;) unloaded at competitive prices to cost-conscious conference caterers.</p>
<p>Around the top of the List of People Your Do Not Want As Your Neighbour on a 12-Hour Flight: 1) Morbidly obese North-American man with lax body hygiene standards. 2) Bratty 12-year-old kid on serious Ritalin withdrawal with no conception of personal space boundaries. 3) Japanese salaryman drowning his existential fear of death (aka looming retirement to a life at home with his wife) in mini-bottles of cheap red wine and lukewarm beer, drank as fast as JAL hostesses will bring them, before taking on the task of becoming your best friend by asking you the same 5 questions a dozen times each.</p>
<p>Great thing about being on the 6-flights-in-7-days program, is that it is perfectly possible to enjoy all of the above, and still have room for a couple boringly normal flights, merely spent trying to find the best way to fit your knees under your chin, while seriously suspecting your seat&#8217;s back of being <em>less</em> than 90º in so-called &#8220;upright&#8221; position.</p>
<p>Northwest sucks. <a href="http://www.keitaigoddess.com/">Tracey</a> (and Co) rules!</p>
<p>Good to know: when they screw up your domestic NRT-KIX connection and leave you stranded in a closing airport at 9pm, Northwest Airlines <em>will</em> give you the option of riding Shinkansen (the following day) at their expense. As for accommodation: you better have awesome friends living in Tokyo, or hope that the weather is warm above the sidewalks of Shinagawa station.</p>
<p>Also good to know: JR counter guys will not, unless specifically asked, give you a receipt for your overpriced 16,000 yen <a href="http://en.wikipedia.org/wiki/Nozomi_(train)">Nozomi</a> + Airport shuttle ticket. You will usually realise that somewhere halfway between Tokyo and Osaka, shortly before you also realise that the last ticket gate eats <em>all</em> your tickets, leaving you with absolutely no proof whatsoever that you ever rode shinkansen on that day, and a very useless Northwest refund claim form.</p>
<p>To my lovely Bostonian who described Boston Summer as an object of fear, on account of unbearable heat and stickiness: do not visit Kyoto, Tokyo or Singapore during the Summer. Ever.</p>
<p>Oddly enough, the extra couple degrees of Singapore weather aren&#8217;t so tough, when experienced from a luxury condo overlooking the residence&#8217;s 24h olympic swimming pool and door-to-door taxi service to the convention center downtown. I always knew it would pay off one day, sharing a couple school years with the future masters of the finance world (the smart ones: those who still have a job today).</p>
<p>Running one of the largest worldwide conference in your domain and requiring from your speakers that they run their slides exclusively through &#8220;the presentation rooms&#8217; laptop [...] equipped with Windows XP SP3 and Microsoft PowerPoint&#8221;&#8230; makes me both extremely irritated and smugly satisfied that I have moved on to another domain ever since.</p>
<p>For no good reason I can think of, I am absolutely, utterly, in love with Singlish&#8230; &#8220;<em>And if these not fit, you bring them back tomorrow lah. Can can lah!</em>&#8220;</p>
<p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2009/08/03/on-being-a-traveling-salesman-of-science/">On being a traveling salesman of science&#8230;</a></p>
]]></content:encoded>
			<wfw:commentRss>http://unknowngenius.com/blog/archives/2009/08/03/on-being-a-traveling-salesman-of-science/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Abstract Art or Science?</title>
		<link>http://unknowngenius.com/blog/archives/2009/06/09/abstract-art-or-science/</link>
		<comments>http://unknowngenius.com/blog/archives/2009/06/09/abstract-art-or-science/#comments</comments>
		<pubDate>Tue, 09 Jun 2009 05:28:08 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Academic Adventures]]></category>
		<category><![CDATA[Bio-info]]></category>
		<category><![CDATA[Quickies]]></category>

		<guid isPermaLink="false">http://unknowngenius.com/blog/?p=2228</guid>
		<description><![CDATA[You decide&#8230; Insulin I &#8211; by Rattus Norvegicus Post originally published on: Dave's Blog (please leave your comments over there)Abstract Art or Science?<p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2009/06/09/abstract-art-or-science/">Abstract Art or Science?</a></p>
]]></description>
			<content:encoded><![CDATA[<p>You decide&#8230;</p>
<p><a href="http://unknowngenius.com/blog/wp-content/uploads/2009/06/insulin_rat.png"><img src="http://unknowngenius.com/blog/wp-content/uploads/2009/06/insulin_rat-1023x278.png" alt="Insulin I - by Rattus Norvegicus" title="Insulin I - by Rattus Norvegicus" width="511" height="139" class="size-large wp-image-2229" /></a></p>
<p><i>Insulin I</i> &#8211; <a href="http://www.ncbi.nlm.nih.gov/protein/9506815?report=genpept&#038;log$=seqview">by Rattus Norvegicus</a></p>
<p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2009/06/09/abstract-art-or-science/">Abstract Art or Science?</a></p>
]]></content:encoded>
			<wfw:commentRss>http://unknowngenius.com/blog/archives/2009/06/09/abstract-art-or-science/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Save the fastest running gene in the galaxy!</title>
		<link>http://unknowngenius.com/blog/archives/2009/03/31/save-the-fastest-running-gene-in-the-galaxy/</link>
		<comments>http://unknowngenius.com/blog/archives/2009/03/31/save-the-fastest-running-gene-in-the-galaxy/#comments</comments>
		<pubDate>Tue, 31 Mar 2009 09:11:09 +0000</pubDate>
		<dc:creator>Dave</dc:creator>
				<category><![CDATA[Bio-info]]></category>
		<category><![CDATA[Procrastination]]></category>

		<guid isPermaLink="false">http://unknowngenius.com/blog/?p=2065</guid>
		<description><![CDATA[If, like me, you delight in advance at the possibility of one day being diagnosed with an incurable disease linked to a gene named after the world&#8217;s most famous blue hedgehog, feel free to circulate the following petition: Dear fellows at the HUGO Gene Nomenclature Guidelines Committee, It has recently come to my attention that [...]<p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2009/03/31/save-the-fastest-running-gene-in-the-galaxy/">Save the fastest running gene in the galaxy!</a></p>
]]></description>
			<content:encoded><![CDATA[<p>If, like me, you delight in advance at the possibility of one day being diagnosed with an incurable disease linked to a gene named after the world&#8217;s <a href="http://en.wikipedia.org/wiki/Sonic_the_Hedgehog_(character)">most famous blue hedgehog</a>, feel free to circulate the following petition:</p>
<blockquote><p>
Dear fellows at the <a href="http://www.genenames.org/">HUGO Gene Nomenclature Guidelines Committee</a>,</p>
<p>It has recently come to my attention that <a href="http://www.genenames.org/cgi-bin/hgnc_hot_topic.pl">you have decided</a> to do away with names deemed &#8220;inappropriate or offensive&#8221; found in the existing international gene nomenclature. </p>
<p>I couldn&#8217;t agree more: I always thought that the officially registered name for gene <i>kill-all-the-Jews-and-drink-their-blood</i> was a bit politicised for a scientific setting.</p>
<p>Whatever you do, however, keep your hands off gene <a href="http://www.genenames.org/data/hgnc_data.php?hgnc_id=HGNC:10848"><b>SHH</b></a>, otherwise officially known as <a href="http://www.ncbi.nlm.nih.gov/UniGene/clust.cgi?ORG=Bt&#038;CID=13121"><i><strong>sonic hedgehog</strong> homolog (Drosophila)</i></a> gene. We like it the way it is (blue, spiky and running very fast). Beside, what better way to break the news of some potentially fatal gene mutation disease to a kid, than by introducing a beloved computer game mascot! </p>
<p>PS: and for chrissake, drop the <a href="http://www.genenames.org/cgi-bin/hgnc_hot_topic.pl">Comic Sans font</a>: it makes your world-class gene database website look like it was coded by a 1st year CS student in 1991 (yes, I know: it probably was).</p>
<p>Love,
</p></blockquote>
<p>Post originally published on: <a href="http://unknowngenius.com/blog">Dave's Blog</a> (please leave your comments over there)<br/><br/><a href="http://unknowngenius.com/blog/archives/2009/03/31/save-the-fastest-running-gene-in-the-galaxy/">Save the fastest running gene in the galaxy!</a></p>
]]></content:encoded>
			<wfw:commentRss>http://unknowngenius.com/blog/archives/2009/03/31/save-the-fastest-running-gene-in-the-galaxy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
