Glenn R. Morton (grmorton@waymark.net)
Tue, 23 Jun 1998 21:23:12 -0500

At 10:17 PM 6/23/98 +0800, Brad Jones wrote:

>------------------------------------------------------------
>Glenn,
>
>I think there are two major problems with your reasoning.
>
>1. you are assuming that the information content of DNA is best
>modeled as a information source. As I will show below this is not the
>case and by doing so you will not produce any informative results.
>
>2. Your analysis of DNA as an information source is also incorrect
>
>The main mistake that you made was to assume that DNA is a "zero
>memory" source. A zero memory source outputs symbols that do not
>depend on the previous symbols; this is not the case with DNA.
>
>A DNA sequence of AAAAATAAAA will output this each and every
>time eg: AAAAATAAAA AAAAATAAAA AAAAATAAAA
>AAAAATAAAA
>
>A zero memory source with the probabilities given would produce
>something like: AATAAAAAAA AAAAATAAAA AAAAAAATAA
>ATAAATAAAA

I believe that you are mixing the way memory works in such systems. Zero
memory usually applies to a Markov chain which doesn't use the previous
character to determine the next. It is not the entire sequence that
'memory' refers to. In English vowels are more likely to appear after a
consonant. English is not a zero memory system since the previous
character influences the next character. q is always followed by u in
English so when q appears, the choice of the next character is entirely
constrained to u.

>
>If you would like this mathematically then the analysis is as follows:
>
>*note that log to base 2 is used to give the result in bits*
>
>sequence: AAAAATAAAA AAAAATAAAA AAAAATAAAA
>AAAAATAAAA
>zero memory:
>P(A) = 0.9, P(T) = 0.1
>H(S)= sum( p(i)*log(1/p(i)) )
> =0.47 bits
>
>2nd extension:
>using groups of two: AA AA AT AA AA ...
>P(AA) = 4/5 P(AT) = 1/5
>H(S^2) = 0.72
>H(S) = 0.36 bits
>
>The 2nd extension has a lower information content than the zero
>memory, this indicates that the source is not a zero memory source.
>
>You can extend this by taking the 3rd, 4th etc and the information
>content will continue to drop. eventually the information rate will
>drop to zero as the 10th extension shows:
>
>AAAAATAAAA AAAAATAAAA AAAAATAAAA
>AAAAATAAAA
>P(AAAAATAAAA) = 1
>H(S) = 0
>

You are using a 10th degree Markov chain and that is not what DNA is. Brian
would you care to comment on this?

>>From this it would be concluded that the source is actually a markov
>source not a zero memory source.

A zero memory Markov chain is what a random sequence is.
>
>The general rule is that any information source that repeats a
>sequence will have zero information. In plain english this is:
>
>"If we already know what is being sent then there is no information
>being sent"
>
>btw this is the basic principal that all compression software relies on.
>
>So you can see that modeling DNA as a information source tells us
>nothing at all.
>
>The method I would use to model DNA is as an information channel.
>This is due to DNA being equivalent to a storage device for
>information (exactly like a CD etc) It can be played as many times as
>it likes but it is not _creating_ any information each time it is played.

Maybe you should tell this to Hubert Yockey. But I don't think he would
agree with you. By the way, DNA is not like a CD. There are mutations in
DNA and they can add information.
>
>The mutations of DNA seem analogous to the errors encountered
>when copying a CD which is quite easily modeled by a correct
>application of information theory.
>
>By doing it this way it is possible to model the random mutations and
>the effect they have on the information, ie the difference they make to
>the information content as opposed to the actual information content.
>The measure of this is called the mutual information of a channel.
>
>I hope this clears it up somewhat, it is quite difficult to explain this in
>easy terms and I would recommend finding a good textbook if you
>really want to pursue this.

I was about to make the same recommendation to you.

glenn