Glenn Morton (Glenn.Morton@oryx-usa.com)
Mon, 29 Jun 1998 10:58:22 -0500

On Sun, 28 Jun 1998 11:03:49 -0700 (PDT), Greg Billock wrote:

>Applying information theory to biology is a bit more realistic, because
>we have lots of examples of 'messages' and can at least make progress
>in detecting their relationships (although the 'code' doesn't seem very
>clear as of yet). A main problem is that nobody really knows how much
>information content is in DNA. If someone tells you there are 2*length
>bits there, you can feel free to say they are full of it; there is
>no way all possible DNA sequences can be associated with living
>organisms. A fraction so small as to be barely detectable on the
>total scale has the chance of being associated with a live organism.

I have a question. I don't want to miss something here. It seems to me that the code we need to deal with is the DNA code. We know how many characters are in the DNA language, 4, and given any length of DNA, we can calculate exactly how many permutations there are. We can't tell how many of those permutations would create a living being, of course(and I agree with you that the percentage of all possible sequences that would create a living being is quite small indeed), but we do know precisely how many permutations there are in any given sequence. So, what exactly do we not know which prevents us from using H=-k p[I]log(p[I]) to calculate the informational content of a given string of DNA? We know k, we know the p[I]'s and we understand logarithms. and that is all we need to know, isn't it?

>Nobody knows what that fraction is, or even where exactly the most
>crucial parts are. So similar to the "0" message, it is a hard
>problem to try to figure out what information theory might do for us
>in biology.

The only way I see that the '0' message applies to DNA is if you are suggesting that there are other nucleotides of which we are unaware. As you suggested with the '0' message, "Since there is basically no way to detect what a one-shot information source could have done, there is
no way to model the distribution of its possible messages, and consequently no way to figure out how much information was gotten. Was I restricting myself to {0, 1}? to numerals? to one-digit
keyboard taps? with some weird probability distribution on them? unrestricted length?"

But with DNA, we have only 4 nucleotides and thus KNOW what we are restricted to. This should make the information content calculable. What am I missing here?

>Even in more behavioral biology, it is hard. Is the
>brain trying to maximize information somehow? If so, where? Do
>species try to maximize genetic information somehow? The problem with
>these scenarios is that they invariably predict utter randomness, since
>that generates the most information, so people are still puzzling about
>how to restrict the channel models so as to better apply it. Who knows,
>perhaps someday a lot will come of it.

If you are talking about the continuity of the DNA message through time that is an interesting issue given Haldane's dilemma. But there is a solution to it, butI am at work and can't look it up and can't recall the details off the top of my head..