**From:** Iain Strachan (*iain.strachan@eudoramail.com*)

**Date:** Sat Dec 09 2000 - 21:59:57 EST

**Previous message:**sheila-mcginty@geotec.net: "the evolution and ID debate"**Maybe in reply to:**Iain Strachan: "RE: Design detection and minimum description length"**Next in thread:**Glenn Morton: "RE: Design detection and minimum description length"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]

I wrote:

*>
*

*>>I really can't (sorry, Glenn), see any problem with this, and how
*

*>>one can go from this to saying that the method has failed because
*

*>>the researchers had to be told it was designed.
*

*>
*

Glenn replied:

*>Let me try in another manner. If I correlate two sequences and get a
*

*>correlation coefficient of .9, my personal state of knowledge isn't what is
*

*>telling me that the sequences are quite similar. The math is telling me. If
*

*>I run a Fourier transform and find that the highest periodicity of a
*

*>sequence is 50 cycles/second, then it isn't my personal state of knowledge
*

*>which tells me that. It is the math. These two procedures allow conclusions
*

*>to be made without any reference to the personal state of knowledge I have
*

*>prior to running the programs. They are objective.
*

Two points:

(1) The effectiveness of the Fourier transform analysis arose from

your observation that you had a sequence with periodicity, and your

personal knowledge that a periodic sequence can be analysed as a

Fourier series. Personally I don't see the difference conceptually

between spotting a Fourier series (i.e. a periodic function) and

spotting a sequence derived from primes. Both are bits of maths that

you had to know in order to make the deduction.

(2) The point about the correlation coefficient is more interesting,

because it precisely illustrates the point that you do need inside

knowledge, and that you can't rely on some "objective math" formula

that allows you to crank the handle and churn out meaningful results.

First, let me quote from a standard text-book on numerical analysis

(Numerical Recipes in C), talking about the correlation coefficient:

"When a correlation is _known to be significant_ [emphasis mine], R

is one conventional way of summarizing its strength. In fact, the

value of R can be translated into a statement about what residuals

(root mean square deviations) are to be expected if the data are

fitted to a straight line by the least-squares method [ref to

equations skipped] ... Unfortunately, R is a rather poor statistic

for deciding _whether_ [emphasis in the original] an observed

correlation is significant, and/or whether one observed correlation

is significantly stronger than another. The reason is that R is

ignorant of the individual distributions of x and y, so there is no

universal way to compute its distribution in the case of a null

hypothesis" [Press, Teukolsky, Vetterling & Flannery: "Numerical

Recipes in C", Second Edition, Cambridge University Press, 1992,

p636].

So, what are Press et al saying here? That the correlation

coefficient R is pretty meaningless unless you know that the data are

correlated already. This is precisely your objection to Dembski (he

can't detect design unless you tell him it's designed, by the "side

information"). Does this mean that the Correlation coefficient is a

totally useless statistic? Not at all. They go on to discuss the

general shape of the distributions of x and y (concerning the

fall-off rate of the tails of the distributions), that allow one do

derive meaningful results and a distribution for R. What it comes

down to is that if your data when plotted on an X/Y scatter plot

looks a bit like a long thin ellipsoid, then you've good reason to

suspect that they are correlated, and from that, you can get

meaningful results by comparing values of R. So, you have to use

your intelligence and prior knowledge of what correlated variables

look like, in order to use the correlation coefficient.

To see just how meaningless the results get if you just put the

numbers into the formula and crank out the result, consider the

following experiment that you can easily perform in Microsoft Excel.

Generate 100 pairs of (x,y) points from random numbers in the range

0-1 (this can be done with the Excel RAND() function. Add a 101st

(x,y) pair and make it equal to (100,100). Now compute the

correlation coefficient between the two sequences (using the Excel

CORREL() function). You will get an answer for R that is close to

0.999. So your "objective math" is telling you that the sequences

are highly correlated.

But something tells me that these sequences are not highly

correlated. What do you think it is? It's my inside knowledge of

what correlated data ought to look like. That tells me that the

(100,100) point is a massive outlier, and should be discarded. (When

R drops to around 0.01).

Is this a silly example that wouldn't occur in real life? I've seen

a lot worse than that. In the first Neural Nets application I worked

on (that ended up as a successfully deployed analysis tool), I was

using a neural net to predict plasma electron density profiles inside

a fusion experiment (the JET vacuum vessel). The electron densities

were of the order of 10^20 per cubic metre. However, the data file I

received had a few electron densities that were of the order of 10^76

per cubic metre. My background knowledge of Physics told me that you

just don't get electron densities of 10^76 per cubic metre in a

vacuum vessel (or anywhere else for that matter ;-). I therefore

concluded that these would be down to a processing error in the

computer program that gave me the file of data, and discarded the

offending items. If I'd naively shoved it all in to the neural net,

it would have ended up predicting everything in the region of 10^75 -

10^76, and the results would have b!

een completely useless.

The moral of the story is that you can't make any statistical

inference (whether it's correlation, pattern detection, or "design")

just by blindly plugging your data into some formula, and relying on

the maths to tell you the answer. You have to use your background

knowledge if it's not to be "Lies, damned lies and statistics".

That is why I don't believe your objection to Dembski's use of "side

information" is a valid one. There may be other reasons for

criticizing Dembski, but this isn't one of them.

I have argued, further, that I believe Dembski's methodology does

have a mathematical procedure, that can be understood in terms of the

minimum description length principle, and that this framework can be

applied to numerical and non-numerical data. While this appears to

be separate from what Dembski writes, it is essentially exactly the

same idea as his discussion of the compressibility of Bit Strings in

terms of the Chaitin/Kolmogorov/Solomonoff "Algorithmic Information

Theory, discussed in detail in Section 2.4 of "No Free Lunch" (p58ff).

Apologies for the long delay in responding to this. Other things intervened.

Regards,

Iain.

Join 18 million Eudora users by signing up for a free Eudora Web-Mail

account at http://www.eudoramail.com

**Next message:**Robert Schneider: "Re: George's reply to Howard"**Previous message:**sheila-mcginty@geotec.net: "the evolution and ID debate"**Maybe in reply to:**Iain Strachan: "RE: Design detection and minimum description length"**Next in thread:**Glenn Morton: "RE: Design detection and minimum description length"**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ] [ attachment ]

*
This archive was generated by hypermail 2.1.4
: Mon Dec 09 2002 - 17:57:26 EST
*